AArch64: Optimize vertical i8mm subpel filters
Replace the accumulator initializations of the vertical subpel filters with register fills by zeros (which are usually zero latency operations in this feature class), this implies the usage of rounding shifts at the end in the prep cases. Out-of-order CPU cores can benefit from this change.
Relative performance of micro benchmarks (lower is better):
Cortex-X3:
mct_8tap_sharp_w16_v_8bpc_i8mm: 0.910x
mct_8tap_sharp_w8_v_8bpc_i8mm: 0.986x
mc_8tap_sharp_w16_v_8bpc_i8mm: 0.864x
mc_8tap_sharp_w8_v_8bpc_i8mm: 0.882x
mc_8tap_sharp_w4_v_8bpc_i8mm: 0.933x
mc_8tap_sharp_w2_v_8bpc_i8mm: 0.926x
Cortex-A715:
mct_8tap_sharp_w16_v_8bpc_i8mm: 0.855x
mct_8tap_sharp_w8_v_8bpc_i8mm: 0.784x
mct_8tap_sharp_w4_v_8bpc_i8mm: 1.069x
mc_8tap_sharp_w16_v_8bpc_i8mm: 0.850x
mc_8tap_sharp_w8_v_8bpc_i8mm: 0.779x
mc_8tap_sharp_w4_v_8bpc_i8mm: 0.971x
mc_8tap_sharp_w2_v_8bpc_i8mm: 0.975x
Cortex-A510:
mct_8tap_sharp_w16_v_8bpc_i8mm: 1.001x
mct_8tap_sharp_w8_v_8bpc_i8mm: 0.979x
mct_8tap_sharp_w4_v_8bpc_i8mm: 0.998x
mc_8tap_sharp_w16_v_8bpc_i8mm: 0.998x
mc_8tap_sharp_w8_v_8bpc_i8mm: 1.004x
mc_8tap_sharp_w4_v_8bpc_i8mm: 1.003x
mc_8tap_sharp_w2_v_8bpc_i8mm: 0.996x