arm64: filmgrain16: Add NEON implementation of gen_grain for 16 bpc
Relative speedup over C code:
Cortex A53 A72 A73 Apple M1
gen_grain_uv_ar0_16bpc_420_neon: 2.90 4.13 5.43 5.80
gen_grain_uv_ar0_16bpc_422_neon: 3.23 4.51 5.52 5.83
gen_grain_uv_ar0_16bpc_444_neon: 4.01 4.97 6.08 5.87
gen_grain_uv_ar1_16bpc_420_neon: 2.94 2.80 3.56 3.48
gen_grain_uv_ar1_16bpc_422_neon: 3.14 3.07 3.68 3.47
gen_grain_uv_ar1_16bpc_444_neon: 3.54 3.51 3.93 2.61
gen_grain_uv_ar2_16bpc_420_neon: 3.92 3.69 4.40 3.98
gen_grain_uv_ar2_16bpc_422_neon: 4.13 3.96 4.42 3.92
gen_grain_uv_ar2_16bpc_444_neon: 4.69 4.33 4.84 3.25
gen_grain_uv_ar3_16bpc_420_neon: 5.05 5.39 5.42 4.74
gen_grain_uv_ar3_16bpc_422_neon: 5.25 5.68 5.57 4.67
gen_grain_uv_ar3_16bpc_444_neon: 6.02 6.33 6.35 4.38
gen_grain_y_ar0_16bpc_neon: 4.67 5.23 5.22 10.11
gen_grain_y_ar1_16bpc_neon: 3.32 3.03 3.28 2.24
gen_grain_y_ar2_16bpc_neon: 4.59 3.95 4.64 3.52
gen_grain_y_ar3_16bpc_neon: 5.89 5.93 6.36 4.79
Absolute numbers:
Cortex A53 A72 A73 Apple M1
gen_grain_uv_ar0_16bpc_420_neon: 19797.2 9725.0 9234.0 29.7
gen_grain_uv_ar0_16bpc_422_neon: 34899.4 16875.3 17021.6 57.7
gen_grain_uv_ar0_16bpc_444_neon: 53776.6 28470.1 28773.1 107.8
gen_grain_uv_ar1_16bpc_420_neon: 37998.2 24631.2 24754.0 84.2
gen_grain_uv_ar1_16bpc_422_neon: 70817.5 44642.5 46323.1 166.3
gen_grain_uv_ar1_16bpc_444_neon: 123333.0 77316.4 83523.1 427.5
gen_grain_uv_ar2_16bpc_420_neon: 49115.8 33053.7 33249.9 93.6
gen_grain_uv_ar2_16bpc_422_neon: 92965.3 59663.8 64741.9 187.9
gen_grain_uv_ar2_16bpc_444_neon: 160899.7 108845.6 115422.4 441.8
gen_grain_uv_ar3_16bpc_420_neon: 65786.6 41924.3 45562.1 108.1
gen_grain_uv_ar3_16bpc_422_neon: 126232.3 78691.6 87351.5 217.6
gen_grain_uv_ar3_16bpc_444_neon: 218702.6 140197.8 151294.8 454.3
gen_grain_y_ar0_16bpc_neon: 35867.9 17653.6 20770.7 108.0
gen_grain_y_ar1_16bpc_neon: 118781.8 74777.1 81338.6 426.0
gen_grain_y_ar2_16bpc_neon: 155919.9 102145.8 109698.1 438.5
gen_grain_y_ar3_16bpc_neon: 213348.1 133054.8 144726.0 447.9
Corresponding numbers for 8bpc:
Cortex A53 A72 A73 Apple M1
gen_grain_uv_ar0_8bpc_420_neon: 15086.1 8384.7 8556.6 29.4
gen_grain_uv_ar0_8bpc_422_neon: 26800.6 14354.4 15526.5 56.6
gen_grain_uv_ar0_8bpc_444_neon: 43749.6 22408.6 24627.9 108.3
gen_grain_uv_ar1_8bpc_420_neon: 33706.3 21892.6 22835.9 87.1
gen_grain_uv_ar1_8bpc_422_neon: 63897.0 41820.1 43468.9 171.8
gen_grain_uv_ar1_8bpc_444_neon: 117345.1 76372.5 79938.3 370.0
gen_grain_uv_ar2_8bpc_420_neon: 42808.8 28493.8 29932.8 92.2
gen_grain_uv_ar2_8bpc_422_neon: 82282.5 53969.4 58191.1 181.8
gen_grain_uv_ar2_8bpc_444_neon: 147641.4 98136.4 103157.6 430.2
gen_grain_uv_ar3_8bpc_420_neon: 56784.3 36342.0 40812.3 102.2
gen_grain_uv_ar3_8bpc_422_neon: 110249.7 70215.6 79716.0 200.5
gen_grain_uv_ar3_8bpc_444_neon: 196461.7 125802.8 141781.5 440.1
gen_grain_y_ar0_8bpc_neon: 36451.7 17794.4 19839.3 109.5
gen_grain_y_ar1_8bpc_neon: 113155.6 71811.9 77296.8 370.2
gen_grain_y_ar2_8bpc_neon: 142812.3 95042.4 100434.4 431.8
gen_grain_y_ar3_8bpc_neon: 191608.6 121199.5 136946.4 437.2
Real world speedup for chimera 10 bpc seems to be from around 281 to 283 fps on an Apple M1.