arm64: ipred: 8 bpc NEON implementations of the Z1 and Z3 functions
The Z3 implementation is a hybrid between two approaches; one generic (but non-ideal) for cases with large max_base_y, which fills two pixel columns at a time, i.e. looping over pixels first vertically, then horizontally - i.e. in a non-optimal manner.
For cases with smaller max_base_y
, it does two rows at a time, essentially doing gathers with the TBX instruction.
Relative speedup over the C code:
Cortex A53 A55 A72 A73 A76 Apple M1
intra_pred_z1_w4_8bpc_neon: 4.09 3.15 3.63 4.16 3.27 13.00
intra_pred_z1_w8_8bpc_neon: 6.93 5.66 5.57 6.76 5.51 5.50
intra_pred_z1_w16_8bpc_neon: 7.81 6.85 6.24 7.78 6.59 9.00
intra_pred_z1_w32_8bpc_neon: 10.56 9.95 8.72 10.95 8.28 13.33
intra_pred_z1_w64_8bpc_neon: 11.00 11.38 9.11 11.62 8.65 14.61
intra_pred_z3_w4_8bpc_neon: 3.32 2.89 2.78 3.52 2.52 9.67
intra_pred_z3_w8_8bpc_neon: 6.24 5.55 4.76 5.60 4.11 6.40
intra_pred_z3_w16_8bpc_neon: 7.64 7.07 4.37 6.23 4.18 8.60
intra_pred_z3_w32_8bpc_neon: 7.51 7.21 4.34 5.92 4.27 7.88
intra_pred_z3_w64_8bpc_neon: 6.82 6.25 4.08 5.83 3.52 7.31
(The speedup numbers for M1 are kinda noisy due to the very coarse granularity of the timer used there.)