arm64: mc: Schedule instructions better in the warp8x8 functions
Before: Cortex A53 A72 A73
warp_8x8_8bpc_neon: 1997.3 1170.1 1199.9
warp_8x8t_8bpc_neon: 1982.4 1171.5 1192.6
After:
warp_8x8_8bpc_neon: 1954.6 1159.2 1153.3
warp_8x8t_8bpc_neon: 1938.5 1146.2 1136.7