Skip to content

AArch64: Optimize Armv8.0 Neon path of HBD horizontal filters

Arpad Panyik requested to merge arpadpanyik-arm/dav1d:mc_hbd_h_neon into master

The reduction parts of the horizontal HBD MC filters use SRSHL+SQXTUN+SRSHL instruction sequences. In the horizontal case this can be rewritten using a single SQSHRUN instruction with an additional rounding value (34 for 10-bit and 40 for 12-bit).

This patch also includes improved instruction scheduling and saves some EXT instructions in the 6-tap part using pointer arithmetic.

Relative runtime of micro benchmarks after this patch on some Cortex CPU cores:

          X4      X1    A720     A78     A76     A72    A520     A55
mc regular:
  w2:  0.848x  0.850x  0.948x  0.873x  0.850x  0.876x  0.856x  0.860x
  w4:  0.898x  0.863x  0.995x  0.978x  0.880x  0.807x  1.035x  1.044x
  w8:  0.855x  0.796x  0.933x  0.855x  0.827x  0.753x  0.891x  0.959x
 w16:  0.860x  0.816x  0.954x  0.926x  0.794x  0.722x  0.880x  0.944x
 w32:  0.853x  0.832x  0.958x  0.929x  0.790x  0.733x  0.877x  0.936x
 w64:  0.854x  0.838x  0.956x  0.935x  0.759x  0.744x  0.875x  0.924x
mc sharp:
  w2:  0.843x  0.854x  0.946x  0.872x  0.836x  0.842x  0.855x  0.862x
  w4:  0.892x  0.861x  0.994x  0.979x  0.875x  0.797x  1.035x  1.044x
  w8:  0.905x  0.881x  1.013x  0.921x  0.881x  0.847x  0.945x  0.976x
 w16:  0.924x  0.892x  1.000x  0.991x  0.860x  0.823x  0.949x  0.975x
 w32:  0.937x  0.896x  0.988x  1.005x  0.832x  0.813x  0.952x  0.976x
 w64:  0.944x  0.901x  0.978x  1.026x  0.832x  0.836x  0.956x  0.966x
          X4      X1    A720     A78     A76     A72    A520     A55
mct regular:
  w4:  1.003x  0.990x  0.996x  0.996x  0.998x  1.002x  1.125x  1.162x
  w8:  0.952x  0.896x  0.996x  0.990x  0.940x  0.903x  0.942x  0.972x
 w16:  0.987x  0.878x  0.997x  0.951x  0.920x  0.937x  0.923x  0.975x
 w32:  0.997x  0.893x  0.999x  0.926x  0.884x  0.920x  0.916x  0.958x
mct sharp:
  w4:  0.998x  0.995x  0.994x  0.990x  1.001x  1.000x  1.126x  1.160x
  w8:  1.002x  1.002x  1.000x  1.001x  0.994x  0.999x  1.001x  1.006x
 w16:  1.001x  1.000x  1.000x  1.000x  1.001x  1.000x  1.000x  1.000x
 w32:  0.999x  1.000x  1.000x  1.000x  1.000x  0.972x  1.000x  0.999x

Merge request reports

Loading