arm32: mc: NEON implementation of avg/mask/w_avg for 16 bpc
Cortex A7 A8 A9 A53 A72 A73
avg_w4_16bpc_neon: 131.4 81.8 117.3 111.0 50.9 58.8
avg_w8_16bpc_neon: 291.9 173.1 293.1 230.9 114.7 128.8
avg_w16_16bpc_neon: 803.3 480.1 821.4 645.8 345.7 384.9
avg_w32_16bpc_neon: 3350.0 1833.1 3188.1 2343.5 1343.9 1500.6
avg_w64_16bpc_neon: 8185.9 4390.6 10448.2 6078.8 3303.6 3466.7
avg_w128_16bpc_neon: 22384.3 10901.2 33721.9 16782.7 8165.1 8416.5
w_avg_w4_16bpc_neon: 251.3 165.8 203.9 158.3 99.6 106.9
w_avg_w8_16bpc_neon: 638.4 427.8 555.7 365.1 283.2 277.4
w_avg_w16_16bpc_neon: 1912.3 1257.5 1623.4 1056.5 879.5 841.8
w_avg_w32_16bpc_neon: 7461.3 4889.6 6383.8 3966.3 3286.8 3296.8
w_avg_w64_16bpc_neon: 18689.3 11698.1 18487.3 10134.1 8156.2 7939.5
w_avg_w128_16bpc_neon: 48776.6 28989.0 53203.3 26004.1 20055.2 20049.4
mask_w4_16bpc_neon: 298.6 189.2 242.3 191.6 115.2 129.6
mask_w8_16bpc_neon: 768.6 501.5 646.1 432.4 302.9 326.8
mask_w16_16bpc_neon: 2320.5 1480.9 1873.0 1270.2 932.2 976.1
mask_w32_16bpc_neon: 9412.0 5791.9 7348.5 4875.1 3896.4 3821.1
mask_w64_16bpc_neon: 23385.9 13875.6 21383.8 12235.9 9469.2 9160.2
mask_w128_16bpc_neon: 60466.4 34762.6 61055.9 31214.0 23299.0 23324.5
For comparison, the corresponding numbers for the existing arm64 implementation:
avg_w4_16bpc_neon: 78.0 38.5 50.0
avg_w8_16bpc_neon: 198.3 105.4 117.8
avg_w16_16bpc_neon: 614.9 339.9 376.7
avg_w32_16bpc_neon: 2313.8 1391.1 1487.7
avg_w64_16bpc_neon: 5733.3 3269.1 3648.4
avg_w128_16bpc_neon: 15105.9 8143.5 8970.4
w_avg_w4_16bpc_neon: 119.2 87.7 92.9
w_avg_w8_16bpc_neon: 322.9 252.3 263.5
w_avg_w16_16bpc_neon: 1016.8 794.0 828.6
w_avg_w32_16bpc_neon: 3910.9 3159.6 3308.3
w_avg_w64_16bpc_neon: 9499.6 7933.9 8026.5
w_avg_w128_16bpc_neon: 24508.3 19502.0 20389.8
mask_w4_16bpc_neon: 138.9 98.7 106.7
mask_w8_16bpc_neon: 375.5 301.1 302.7
mask_w16_16bpc_neon: 1217.2 1064.6 954.4
mask_w32_16bpc_neon: 4821.0 4018.4 3825.7
mask_w64_16bpc_neon: 12262.7 9471.3 9169.7
mask_w128_16bpc_neon: 31356.6 22657.6 23324.5