AArch64: Optimize horizontal i8mm prep filters
Replace the accumulator initializations of the horizontal prep
filters with register fills by zeros. Most i8mm
capable CPUs can do
these with zero latency, but we also need to use rounding shifts at
the end of the filter. We can see better performance with this
change on out-of-order CPUs.
Relative performance of micro benchmarks (lower is better):
Cortex-X3:
mct_8tap_sharp_w32_h_8bpc_i8mm: 0.914x
mct_8tap_sharp_w16_h_8bpc_i8mm: 0.906x
mct_8tap_sharp_w8_h_8bpc_i8mm: 0.877x
Cortex-A715:
mct_8tap_sharp_w32_h_8bpc_i8mm: 0.819x
mct_8tap_sharp_w16_h_8bpc_i8mm: 0.805x
mct_8tap_sharp_w8_h_8bpc_i8mm: 0.779x
Cortex-A510:
mct_8tap_sharp_w32_h_8bpc_i8mm: 0.999x
mct_8tap_sharp_w16_h_8bpc_i8mm: 1.001x
mct_8tap_sharp_w8_h_8bpc_i8mm: 0.996x
mct_8tap_sharp_w4_h_8bpc_i8mm: 0.915x