AArch64: Add USMMLA impl. for SBD 6-tap H/HV filters
Add 6-tap variant of standard bit-depth horizontal subpel filters
using the Armv8.6 I8MM USMMLA
matrix multiply instruction. This patch
also extends the HV filter with 6-tap horizontal pass using USMMLA
.
This MR also contains a typo fix in the SBD 6-tap 2D/HV subpel filter.
Benchmarks show up-to 6-7% FPS increase depending on the input video and the CPU used.
This patch will increase the .text
by around 1.2 KiB.
Relative runtime of micro benchmarks after this patch on Neoverse and Cortex CPU cores:
regular V2 V1 X3 A720 A715 A520 A510
w8 hv: 0.860x 0.895x 0.870x 0.896x 0.896x 0.938x 0.936x
w16 hv: 0.829x 0.886x 0.865x 0.908x 0.906x 0.946x 0.944x
w32 hv: 0.837x 0.883x 0.862x 0.914x 0.915x 0.953x 0.949x
w64 hv: 0.840x 0.883x 0.862x 0.914x 0.914x 0.955x 0.952x
w8 h: 0.746x 0.754x 0.747x 0.723x 0.724x 0.874x 0.866x
w16 h: 0.749x 0.764x 0.745x 0.731x 0.731x 0.858x 0.852x
w32 h: 0.739x 0.754x 0.738x 0.729x 0.729x 0.839x 0.837x
w64 h: 0.736x 0.749x 0.733x 0.725x 0.726x 0.847x 0.836x
Some benchmark results for USMMLA
version against using USDOT
:
AWS Graviton 3: 193.69 fps -> 200.04 fps ( +3.28% )
AWS Graviton 4: 246.09 fps -> 255.00 fps ( +3.62% )
AWS Graviton 3: 176.33 fps -> 180.61 fps ( +2.43% )
AWS Graviton 4: 225.18 fps -> 231.24 fps ( +2.69% )
AWS Graviton 3: 144.80 fps -> 147.89 fps ( +2.13% )
AWS Graviton 4: 183.44 fps -> 187.38 fps ( +2.15% )
AWS Graviton 3: 140.22 fps -> 142.42 fps ( +1.57% )
AWS Graviton 4: 178.17 fps -> 181.14 fps ( +1.67% )
AWS Graviton 3: 200.66 fps -> 205.25 fps ( +2.29% )
AWS Graviton 4: 260.92 fps -> 266.68 fps ( +2.21% )
AWS Graviton 3 - 720p: 540.45 fps -> 580.43 fps ( +7.40% )
AWS Graviton 4 - 720p: 680.19 fps -> 724.72 fps ( +6.55% )
AWS Graviton 3 - 1080p: 242.12 fps -> 256.48 fps ( +5.93% )
AWS Graviton 4 - 1080p: 305.77 fps -> 326.12 fps ( +6.66% )
AWS Graviton 3 - 2160p: 60.60 fps -> 63.37 fps ( +4.57% )
AWS Graviton 4 - 2160p: 76.71 fps -> 80.69 fps ( +5.19% )
Bosphorus videos were encoded by aomenc (3.7.1+):
aomenc --good --cpu-used=5 -w 1280 -h 720 --bit-depth=8 --ivf -o Bosphorus_720p_8bit.ivf Bosphorus_1920x1080_120fps_420_8bit_YUV.y4m
aomenc --good --cpu-used=5 -w 1920 -h 1080 --bit-depth=8 --ivf -o Bosphorus_1080p_8bit.ivf Bosphorus_1920x1080_120fps_420_8bit_YUV.y4m
aomenc --good --cpu-used=5 -w 3840 -h 2160 --bit-depth=8 --ivf -o Bosphorus_2160p_8bit.ivf Bosphorus_3840x2160_120fps_420_8bit_YUV.y4m
Edited by Arpad Panyik