arm32: filmgrain: Add NEON implementation of gen_grain for 16 bpc
Relative speedup over C code:
Cortex A7 A8 A9 A53 A72 A73
gen_grain_uv_ar0_16bpc_420_neon: 5.05 6.71 5.42 4.95 6.45 9.59
gen_grain_uv_ar0_16bpc_422_neon: 5.54 7.18 6.29 5.45 6.55 8.80
gen_grain_uv_ar0_16bpc_444_neon: 6.64 8.07 6.70 6.89 7.16 9.98
gen_grain_uv_ar1_16bpc_420_neon: 3.22 2.16 2.58 3.51 3.16 4.68
gen_grain_uv_ar1_16bpc_422_neon: 3.24 2.26 2.73 3.83 3.36 4.65
gen_grain_uv_ar1_16bpc_444_neon: 3.48 2.41 2.85 4.32 3.69 4.90
gen_grain_uv_ar2_16bpc_420_neon: 3.29 2.90 2.92 4.14 3.48 4.59
gen_grain_uv_ar2_16bpc_422_neon: 3.35 3.01 3.13 4.50 3.61 4.50
gen_grain_uv_ar2_16bpc_444_neon: 3.66 3.55 3.32 5.15 3.87 4.93
gen_grain_uv_ar3_16bpc_420_neon: 3.39 3.79 3.60 4.67 4.04 4.70
gen_grain_uv_ar3_16bpc_422_neon: 3.39 4.04 3.96 4.93 4.16 4.65
gen_grain_uv_ar3_16bpc_444_neon: 3.79 4.47 4.36 5.54 4.59 5.07
gen_grain_y_ar0_16bpc_neon: 5.05 5.26 6.97 5.47 5.95 8.59
gen_grain_y_ar1_16bpc_neon: 2.35 1.72 2.07 3.53 3.16 3.47
gen_grain_y_ar2_16bpc_neon: 3.02 2.70 2.88 4.19 3.57 4.03
gen_grain_y_ar3_16bpc_neon: 3.49 3.18 3.69 5.01 3.99 4.50