arm: Add NEON implementations of splat_mv
Relative speedup over C code, for arm64:
Cortex A53 A72 A73 Apple M1
splat_mv_w1_neon: 1.01 0.91 1.17 -
splat_mv_w2_neon: 1.65 2.01 1.45 -
splat_mv_w4_neon: 2.55 2.10 1.82 -
splat_mv_w8_neon: 3.43 2.09 2.57 12.00
splat_mv_w16_neon: 3.92 1.73 2.44 3.38
splat_mv_w32_neon: 4.01 1.60 2.28 2.89
(The resolution of the timer used on Apple M1 isn't enough to measure the small versions of this function.)
Relative speedup over C code, for arm32:
Cortex A7 A8 A9 A53 A72 A73
splat_mv_w1_neon: 0.69 1.05 0.88 0.62 1.06 1.05
splat_mv_w2_neon: 0.93 2.02 1.95 0.92 2.63 1.41
splat_mv_w4_neon: 1.23 1.96 1.43 1.44 2.07 1.83
splat_mv_w8_neon: 1.70 2.46 1.10 2.76 2.11 2.54
splat_mv_w16_neon: 1.93 2.43 1.11 3.19 1.80 2.64
splat_mv_w32_neon: 1.65 2.26 1.18 3.53 1.77 2.66
@janne Do you have any other things you want to test tuning-wise for the smaller sizes (where the current implementation ends up a little slower than C code)?