arm64: mc: NEON implementation of warp for 16 bpc
Checkasm benchmark numbers:
Cortex A53 A72 A73
warp_8x8_16bpc_neon: 2029.9 1150.5 1225.2
warp_8x8t_16bpc_neon: 2007.6 1129.0 1192.3
Corresponding numbers for 8bpc for comparison:
warp_8x8_8bpc_neon: 1863.8 1052.8 1106.2
warp_8x8t_8bpc_neon: 1847.4 1048.3 1099.8