meta data for this page
Neon
Bayer RAW10 packed to Y16
Data is organised in 5 bytes chunks to store 4 x RAW10 pixel data:
- 4 bytes of 8bit MSB of 10bit
- 1 byte of 4 x 2bit LSB
General idea:
- Load data
- Neon can perform interleaved loads by 4
- Neon can use lookup table load, so we can specify what byte loads where
- Extend 8 bit vectors to 16bit vectors
- max register size ? 128bit so 16 x 8 bit or 8 x 16bit. Load only 8x 8bit to extend it to 8x16bit ? Or split using vget_low_s8() ?
- with shift left by 2 ?
- ??? How to insert 2 LSB bits ?
- Create 3 more copies of byte with LSB bits
- In each copy, organise bits to be in correct place.
- Insert bits to 16bit OR - there is instruction to insert and shift ???
- Store data:
- Interleaved store by 3
Useful info on ARM:
Useful info:
From https://stackoverflow.com/questions/71554911/how-to-vectorize-2d-array-using-neon-intrinsics:
Consider using OpenMP, add #pragma omp parallel for before for loop and -fopenmp to the compiler cmdline
Auto vectorization in GCC:
Instead of using ARM Neon, use OpenCV wrapper which provides portability across platforms: