AArch64: Optimize Armv8.0 Neon path of HBD horizontal filters
The reduction parts of the horizontal HBD MC filters use SRSHL
+SQXTUN
+SRSHL
instruction sequences. In the horizontal case this can be
rewritten using a single SQSHRUN
instruction with an additional rounding value (34 for 10-bit and 40 for 12-bit).
This patch also includes improved instruction scheduling and saves some EXT
instructions in the 6-tap part using pointer arithmetic.
Relative runtime of micro benchmarks after this patch on some Cortex CPU cores:
X4 X1 A720 A78 A76 A72 A520 A55
mc regular:
w2: 0.848x 0.850x 0.948x 0.873x 0.850x 0.876x 0.856x 0.860x
w4: 0.898x 0.863x 0.995x 0.978x 0.880x 0.807x 1.035x 1.044x
w8: 0.855x 0.796x 0.933x 0.855x 0.827x 0.753x 0.891x 0.959x
w16: 0.860x 0.816x 0.954x 0.926x 0.794x 0.722x 0.880x 0.944x
w32: 0.853x 0.832x 0.958x 0.929x 0.790x 0.733x 0.877x 0.936x
w64: 0.854x 0.838x 0.956x 0.935x 0.759x 0.744x 0.875x 0.924x
mc sharp:
w2: 0.843x 0.854x 0.946x 0.872x 0.836x 0.842x 0.855x 0.862x
w4: 0.892x 0.861x 0.994x 0.979x 0.875x 0.797x 1.035x 1.044x
w8: 0.905x 0.881x 1.013x 0.921x 0.881x 0.847x 0.945x 0.976x
w16: 0.924x 0.892x 1.000x 0.991x 0.860x 0.823x 0.949x 0.975x
w32: 0.937x 0.896x 0.988x 1.005x 0.832x 0.813x 0.952x 0.976x
w64: 0.944x 0.901x 0.978x 1.026x 0.832x 0.836x 0.956x 0.966x
X4 X1 A720 A78 A76 A72 A520 A55
mct regular:
w4: 1.003x 0.990x 0.996x 0.996x 0.998x 1.002x 1.125x 1.162x
w8: 0.952x 0.896x 0.996x 0.990x 0.940x 0.903x 0.942x 0.972x
w16: 0.987x 0.878x 0.997x 0.951x 0.920x 0.937x 0.923x 0.975x
w32: 0.997x 0.893x 0.999x 0.926x 0.884x 0.920x 0.916x 0.958x
mct sharp:
w4: 0.998x 0.995x 0.994x 0.990x 1.001x 1.000x 1.126x 1.160x
w8: 1.002x 1.002x 1.000x 1.001x 0.994x 0.999x 1.001x 1.006x
w16: 1.001x 1.000x 1.000x 1.000x 1.001x 1.000x 1.000x 1.000x
w32: 0.999x 1.000x 1.000x 1.000x 1.000x 0.972x 1.000x 0.999x