====== Neon ======

===== Bayer RAW10 packed to Y16 =====

Data is organised in 5 bytes packed chunks to store 4 x RAW10 pixel data:

General idea:
  * Load data
    * Neon can perform interleaved loads by 4
    * Neon can use lookup table load, so we can specify what byte loads where
  * Extend 8 bit vectors to 16bit vectors
    * max register size ? 128bit so 16 x 8 bit or 8 x 16bit. Load only  8x 8bit to extend it to 8x16bit ? Or split using vget_low_s8() ?
    * with shift left by 2 ?
  * ??? How to insert 2 LSB bits ?
    * Create 3 more copies of byte with LSB bits
    * In each copy, organise bits to be in correct place.
    * Insert bits to 16bit OR - there is instruction to insert and shift ???
  * Store data:
    * Interleaved store by 3


Useful info on ARM:
  * [[https://developer.arm.com/documentation/101964/0300/libTIFF-optimization--CMYK-to-RGBA-conversion|libTIFF optimization: CMYK to RGBA conversion]]
  * [[https://developer.arm.com/documentation/101964/0300/Chromium-optimization--pre-multiplied-alpha-channel-data|Chromium optimization: pre-multiplied alpha channel data]]
  * [[https://developer.arm.com/documentation/102159/0400/Shifting-left-and-right?lang=en]]
  * [[https://developer.arm.com/documentation/102159/0400/Load-and-store---example-RGB-conversion?lang=en|Load and store - example RGB conversion]]
  * [[https://developer.arm.com/documentation/102107a/0100/RGB-to-grayscale-conversion|RGB to grayscale conversion]]
  * [[https://developer.arm.com/documentation/den0018/a/NEON-Code-Examples-with-Optimization/Converting-color-depth/Converting-from-RGB565-to-RGB888]]
  * [[https://developer.arm.com/community/arm-community-blogs/b/architectures-and-processors-blog/posts/coding-for-neon---part-4-shifting-left-and-right|Coding for Neon - Part 4: Shifting Left and Right]]


Useful info:
  * [[https://stackoverflow.com/questions/57363580/neon-unpacking-int8x16-t-into-a-pair-of-int16x8-packing-a-pair-of-int16x8-t-i|Unpacking int8x16_t into a pair of int16x8 & packing a pair of int16x8_t into a int8x16_t]]
  * [[https://stackoverflow.com/questions/44353277/how-to-code-ai-bci-on-arm-neon-simd-intrinsic-function|How to code "a[i]=b[c[i]]" on ARM NEON SIMD Intrinsic function]] (lookup table load)
  * [[https://github.com/yszheda/rgb2yuv-neon/blob/master/yuv444.cpp#L105C21-L105C30|rgb2yuv-neon]]
  * [[http://0x80.pl/notesen/2017-01-07-base64-simd-neon.html|ARM Neon and Base64 encoding & decoding]]
  * [[https://lemire.me/blog/2017/07/10/pruning-spaces-faster-on-arm-processors-with-vector-table-lookups/|Pruning spaces faster on ARM processors with Vector Table Lookups]]
  * [[https://en.eeworld.com.cn/news/mcu/eic309253.html|ARM processor NEON programming and optimization techniques - shift operations such as left shift and right shift]]

From [[https://stackoverflow.com/questions/71554911/how-to-vectorize-2d-array-using-neon-intrinsics]]:

  Consider using OpenMP, add #pragma omp parallel for before for loop and -fopenmp to the compiler cmdline

Auto vectorization in GCC:
  * [[https://developers.redhat.com/articles/2023/12/08/vectorization-optimization-gcc#|Vectorization optimization in GCC]]

Instead of using ARM Neon, use OpenCV wrapper which provides portability across platforms:
  * [[https://answers.opencv.org/question/224485/simd-optimizations-get-no-performance-gains-on-arm-neon/]]