arm32: filmgrain: Add NEON implementation of gen_grain for 8 bpc
Relative speedup over C code:
Cortex A7 A8 A9 A53 A72 A73
gen_grain_uv_ar0_8bpc_420_neon: 6.13 7.81 8.17 6.78 6.62 11.13
gen_grain_uv_ar0_8bpc_422_neon: 6.34 7.64 8.00 6.83 6.93 10.31
gen_grain_uv_ar0_8bpc_444_neon: 7.09 8.29 8.55 7.95 7.89 11.05
gen_grain_uv_ar1_8bpc_420_neon: 3.39 2.26 3.06 4.13 3.41 4.95
gen_grain_uv_ar1_8bpc_422_neon: 3.40 2.23 3.02 4.18 3.36 4.73
gen_grain_uv_ar1_8bpc_444_neon: 3.46 2.18 2.95 4.46 3.57 4.91
gen_grain_uv_ar2_8bpc_420_neon: 3.88 3.00 3.32 4.74 3.57 5.31
gen_grain_uv_ar2_8bpc_422_neon: 3.92 3.04 3.36 4.82 3.57 5.06
gen_grain_uv_ar2_8bpc_444_neon: 4.32 3.14 3.62 5.56 3.90 5.43
gen_grain_uv_ar3_8bpc_420_neon: 4.35 3.53 4.05 5.35 4.44 5.56
gen_grain_uv_ar3_8bpc_422_neon: 4.38 3.49 4.17 5.41 4.48 5.36
gen_grain_uv_ar3_8bpc_444_neon: 4.84 3.70 4.36 5.95 4.87 5.82
gen_grain_y_ar0_8bpc_neon: 5.18 5.57 7.65 5.93 7.13 9.01
gen_grain_y_ar1_8bpc_neon: 2.64 1.66 2.48 3.32 3.15 3.77
gen_grain_y_ar2_8bpc_neon: 3.57 2.64 3.21 4.59 3.68 4.64
gen_grain_y_ar3_8bpc_neon: 4.27 3.93 4.12 5.41 4.63 5.17
(A73 is benched against C code compiled with a different C compiler, which can explain the slightly differing numbers there.)
Absolute numbers:
Cortex A7 A8 A9 A53 A72 A73
gen_grain_uv_ar0_8bpc_420_neon: 19614.6 13396.4 12320.4 15030.7 8288.1 8754.4
gen_grain_uv_ar0_8bpc_422_neon: 34660.9 24315.5 22225.3 26809.2 14549.8 15804.6
gen_grain_uv_ar0_8bpc_444_neon: 55625.6 39914.5 37100.2 44658.3 22917.3 27369.6
gen_grain_uv_ar1_8bpc_420_neon: 50049.5 63179.4 44793.1 36406.7 22690.3 25401.9
gen_grain_uv_ar1_8bpc_422_neon: 93289.5 117755.0 82815.4 67081.4 43133.1 46698.0
gen_grain_uv_ar1_8bpc_444_neon: 170880.0 223259.2 156241.5 122760.0 78655.6 85604.9
gen_grain_uv_ar2_8bpc_420_neon: 68185.5 78123.2 61457.3 47886.7 31526.2 36519.6
gen_grain_uv_ar2_8bpc_422_neon: 129195.2 148653.9 114133.2 89822.7 60242.6 70160.1
gen_grain_uv_ar2_8bpc_444_neon: 233133.7 272277.4 214108.7 161589.5 109069.3 127763.7
gen_grain_uv_ar3_8bpc_420_neon: 96374.4 94372.2 79663.8 70832.0 43065.3 50593.9
gen_grain_uv_ar3_8bpc_422_neon: 186324.8 184321.8 151490.1 136200.1 83758.0 98378.7
gen_grain_uv_ar3_8bpc_444_neon: 335596.6 336811.6 279755.5 247251.5 151657.2 178906.0
gen_grain_y_ar0_8bpc_neon: 46109.3 36022.2 28476.2 36478.5 18740.1 20660.4
gen_grain_y_ar1_8bpc_neon: 165054.2 217090.4 152578.9 118409.4 74357.2 83794.5
gen_grain_y_ar2_8bpc_neon: 226576.9 268320.3 210924.6 157829.4 105956.5 124293.2
gen_grain_y_ar3_8bpc_neon: 328337.2 330421.3 275110.1 242097.3 148538.7 177270.8
Corresponding numbers for the original arm64 version:
Cortex A53 A72 A73
gen_grain_uv_ar0_8bpc_420_neon: 14874.7 7765.5 8536.0
gen_grain_uv_ar0_8bpc_422_neon: 26510.9 13685.3 15308.2
gen_grain_uv_ar0_8bpc_444_neon: 43189.6 21565.3 24312.0
gen_grain_uv_ar1_8bpc_420_neon: 33715.7 21669.8 22758.3
gen_grain_uv_ar1_8bpc_422_neon: 63955.3 41581.4 42852.5
gen_grain_uv_ar1_8bpc_444_neon: 117390.1 76503.5 78446.4
gen_grain_uv_ar2_8bpc_420_neon: 42779.0 27794.3 29677.9
gen_grain_uv_ar2_8bpc_422_neon: 82283.8 53446.7 58232.2
gen_grain_uv_ar2_8bpc_444_neon: 147773.8 98492.7 103754.1
gen_grain_uv_ar3_8bpc_420_neon: 56698.8 35697.1 40695.9
gen_grain_uv_ar3_8bpc_422_neon: 110132.4 69829.1 79196.8
gen_grain_uv_ar3_8bpc_444_neon: 196642.7 124174.9 141812.5
gen_grain_y_ar0_8bpc_neon: 36461.0 17782.0 19827.0
gen_grain_y_ar1_8bpc_neon: 113202.7 72457.7 75995.8
gen_grain_y_ar2_8bpc_neon: 142894.0 94450.9 100304.5
gen_grain_y_ar3_8bpc_neon: 191697.7 120674.9 137223.8
The arm64 version uses lots of registers (21 different GPRs in total, and the hot loop uses 18 of them), which causes some overhead to make that work on arm32 with much fewer available registers.