====== Neon ====== ===== Bayer RAW10 packed to Y16 ===== Data is organised in 5 bytes chunks to store 4 x RAW10 pixel data: * 4 bytes of 8bit MSB of 10bit * 1 byte of 4 x 2bit LSB General idea: * Load data * Neon can perform interleaved loads by 4 * Neon can use lookup table load, so we can specify what byte loads where * Extend 8 bit vectors to 16bit vectors * max register size ? 128bit so 16 x 8 bit or 8 x 16bit. Load only 8x 8bit to extend it to 8x16bit ? Or split using vget_low_s8() ? * with shift left by 2 ? * ??? How to insert 2 LSB bits ? * Create 3 more copies of byte with LSB bits * In each copy, organise bits to be in correct place. * Insert bits to 16bit OR - there is instruction to insert and shift ??? * Store data: * Interleaved store by 3 Useful info on ARM: * [[https://developer.arm.com/documentation/101964/0300/libTIFF-optimization--CMYK-to-RGBA-conversion|libTIFF optimization: CMYK to RGBA conversion]] * [[https://developer.arm.com/documentation/101964/0300/Chromium-optimization--pre-multiplied-alpha-channel-data|Chromium optimization: pre-multiplied alpha channel data]] * [[https://developer.arm.com/documentation/102159/0400/Shifting-left-and-right?lang=en]] * [[https://developer.arm.com/documentation/102159/0400/Load-and-store---example-RGB-conversion?lang=en|Load and store - example RGB conversion]] * [[https://developer.arm.com/documentation/102107a/0100/RGB-to-grayscale-conversion|RGB to grayscale conversion]] * [[https://developer.arm.com/documentation/den0018/a/NEON-Code-Examples-with-Optimization/Converting-color-depth/Converting-from-RGB565-to-RGB888]] * [[https://developer.arm.com/community/arm-community-blogs/b/architectures-and-processors-blog/posts/coding-for-neon---part-4-shifting-left-and-right|Coding for Neon - Part 4: Shifting Left and Right]] Useful info: * [[https://stackoverflow.com/questions/57363580/neon-unpacking-int8x16-t-into-a-pair-of-int16x8-packing-a-pair-of-int16x8-t-i|Unpacking int8x16_t into a pair of int16x8 & packing a pair of int16x8_t into a int8x16_t]] * [[https://stackoverflow.com/questions/44353277/how-to-code-ai-bci-on-arm-neon-simd-intrinsic-function]] * [[https://github.com/yszheda/rgb2yuv-neon/blob/master/yuv444.cpp#L105C21-L105C30|rgb2yuv-neon]] * [[http://0x80.pl/notesen/2017-01-07-base64-simd-neon.html|ARM Neon and Base64 encoding & decoding]] * [[https://lemire.me/blog/2017/07/10/pruning-spaces-faster-on-arm-processors-with-vector-table-lookups/|Pruning spaces faster on ARM processors with Vector Table Lookups]] * [[https://en.eeworld.com.cn/news/mcu/eic309253.html|ARM processor NEON programming and optimization techniques - shift operations such as left shift and right shift]] From [[https://stackoverflow.com/questions/71554911/how-to-vectorize-2d-array-using-neon-intrinsics]]: Consider using OpenMP, add #pragma omp parallel for before for loop and -fopenmp to the compiler cmdline Auto vectorization in GCC: * [[https://developers.redhat.com/articles/2023/12/08/vectorization-optimization-gcc#|Vectorization optimization in GCC]] Instead of using ARM Neon, use OpenCV wrapper which provides portability across platforms: * [[https://answers.opencv.org/question/224485/simd-optimizations-get-no-performance-gains-on-arm-neon/]]