arm32: mc: Optimize warp by doing horz filtering in 8 bit
Additionally reschedule instructions for loading, to reduce stalls on in order cores.
This applies the changes from a3b8157e on the arm32 version.
Before: Cortex A7 A8 A9 A53 A72 A73
warp_8x8_8bpc_neon: 3659.3 1746.0 1931.9 2128.8 1173.7 1188.9
warp_8x8t_8bpc_neon: 3650.8 1724.6 1919.8 2105.0 1147.7 1206.9
warp_8x8_16bpc_neon: 4039.4 2111.9 2337.1 2462.5 1334.6 1396.5
warp_8x8t_16bpc_neon: 3973.9 2137.1 2299.6 2413.2 1282.8 1369.6
After:
warp_8x8_8bpc_neon: 2920.8 1269.8 1410.3 1767.3 860.2 1004.8
warp_8x8t_8bpc_neon: 2904.9 1283.9 1397.5 1743.7 863.6 1024.7
warp_8x8_16bpc_neon: 3895.5 2060.7 2339.8 2376.6 1331.1 1394.0
warp_8x8t_16bpc_neon: 3822.7 2026.7 2298.7 2325.4 1278.1 1360.8
Edited by Martin Storsjö