AArch64: Optimize horizontal i8mm prep filters (!1658) · Merge requests · VideoLAN / dav1d

Arpad Panyik requested to merge arpadpanyik-arm/dav1d:mc_sbd_i8mm_h into master Apr 26, 2024

Replace the accumulator initializations of the horizontal prep filters with register fills by zeros. Most i8mm capable CPUs can do these with zero latency, but we also need to use rounding shifts at the end of the filter. We can see better performance with this change on out-of-order CPUs.

Relative performance of micro benchmarks (lower is better):

Cortex-X3:

mct_8tap_sharp_w32_h_8bpc_i8mm:  0.914x
mct_8tap_sharp_w16_h_8bpc_i8mm:  0.906x
mct_8tap_sharp_w8_h_8bpc_i8mm:   0.877x

Cortex-A715:

mct_8tap_sharp_w32_h_8bpc_i8mm:  0.819x
mct_8tap_sharp_w16_h_8bpc_i8mm:  0.805x
mct_8tap_sharp_w8_h_8bpc_i8mm:   0.779x

Cortex-A510:

mct_8tap_sharp_w32_h_8bpc_i8mm:  0.999x
mct_8tap_sharp_w16_h_8bpc_i8mm:  1.001x
mct_8tap_sharp_w8_h_8bpc_i8mm:   0.996x
mct_8tap_sharp_w4_h_8bpc_i8mm:   0.915x

AArch64: Optimize horizontal i8mm prep filters

Merge request reports