arm64: ipred: 8 bpc NEON implementation of the Z2 function
Relative speedup over C code:
Cortex A53 A55 A72 A73 A76 Apple M1
intra_pred_z2_w4_8bpc_neon: 3.91 3.55 3.31 3.94 3.46 8.50
intra_pred_z2_w8_8bpc_neon: 5.68 5.67 4.31 5.31 4.34 5.83
intra_pred_z2_w16_8bpc_neon: 8.39 9.28 5.53 7.04 7.01 9.45
intra_pred_z2_w32_8bpc_neon: 7.01 8.01 5.04 6.32 5.48 7.48
intra_pred_z2_w64_8bpc_neon: 8.73 10.25 5.92 7.61 6.63 10.05