AArch64: Add i8mm support for convolutions
This is a follow-up work of !1632 (merged).
Add an Armv8.6-A i8mm code path for standard bitdepth convolutions.
Only horizontal-vertical (HV) convolutions have 6-tap specialisations
of their vertical passes. All other convolutions are 4- or 8-tap
filters which fit well with the 4-element USDOT
instruction.
Benchmarks show 4-9% FPS increase relative to the Armv8.4-A code path depending on the input video and the CPU used.
This patch will increase the .text
by around 5.7 KiB.
Relative performance to the C reference on some CPUs:
Horizontal-vertical micro benchmarks
A715-mct A715-mc X3-mct X3-mc A510-mct A510-mc
regular_w2_hv_8bpc_neon: 5.64x 7.21x 2.86x
regular_w2_hv_8bpc_dotprod: 6.05x 7.98x 3.00x
regular_w2_hv_8bpc_i8mm: 7.06x 8.69x 3.04x
sharp_w2_hv_8bpc_neon: 5.20x 6.04x 2.66x
sharp_w2_hv_8bpc_dotprod: 4.78x 5.83x 2.63x
sharp_w2_hv_8bpc_i8mm: 5.31x 6.41x 2.71x
regular_w4_hv_8bpc_neon: 7.20x 6.34x 11.20x 9.54x 4.40x 3.91x
regular_w4_hv_8bpc_dotprod: 12.77x 10.98x 18.35x 14.57x 6.21x 5.45x
regular_w4_hv_8bpc_i8mm: 14.50x 12.83x 21.42x 15.85x 6.16x 5.54x
sharp_w4_hv_8bpc_neon: 6.24x 5.40x 9.77x 8.24x 3.96x 3.48x
sharp_w4_hv_8bpc_dotprod: 9.76x 8.77x 14.02x 11.61x 5.20x 4.78x
sharp_w4_hv_8bpc_i8mm: 10.84x 9.70x 16.09x 12.68x 5.42x 4.90x
regular_w8_hv_8bpc_neon: 2.17x 2.27x 2.46x 2.57x 3.17x 3.28x
regular_w8_hv_8bpc_dotprod: 3.04x 3.18x 3.11x 3.42x 3.03x 2.98x
regular_w8_hv_8bpc_i8mm: 3.57x 3.87x 3.40x 3.69x 3.27x 3.26x
sharp_w8_hv_8bpc_neon: 1.72x 1.82x 1.93x 2.05x 2.75x 2.86x
sharp_w8_hv_8bpc_dotprod: 2.49x 2.65x 2.54x 2.81x 2.62x 2.38x
sharp_w8_hv_8bpc_i8mm: 2.80x 3.03x 2.79x 3.07x 2.70x 2.70x
regular_w16_hv_8bpc_neon: 1.90x 2.09x 2.17x 2.18x 2.02x 1.99x
regular_w16_hv_8bpc_dotprod: 2.59x 2.85x 2.64x 2.79x 1.93x 1.83x
regular_w16_hv_8bpc_i8mm: 3.01x 3.33x 2.85x 2.94x 2.05x 1.97x
sharp_w16_hv_8bpc_neon: 1.51x 1.67x 1.72x 1.76x 1.74x 1.73x
sharp_w16_hv_8bpc_dotprod: 2.17x 2.41x 2.22x 2.35x 1.70x 1.46x
sharp_w16_hv_8bpc_i8mm: 2.42x 2.69x 2.42x 2.54x 1.72x 1.65x
regular_w32_hv_8bpc_neon: 1.80x 2.01x 1.96x 2.04x 1.81x 1.81x
regular_w32_hv_8bpc_dotprod: 2.43x 2.68x 2.36x 2.55x 1.74x 1.67x
regular_w32_hv_8bpc_i8mm: 2.83x 3.17x 2.51x 2.67x 1.83x 1.78x
sharp_w32_hv_8bpc_neon: 1.42x 1.59x 1.54x 1.64x 1.56x 1.57x
sharp_w32_hv_8bpc_dotprod: 2.07x 2.30x 2.00x 2.17x 1.55x 1.34x
sharp_w32_hv_8bpc_i8mm: 2.29x 2.55x 2.16x 2.33x 1.55x 1.49x
regular_w64_hv_8bpc_neon: 1.82x 1.94x 1.89x 1.95x 1.70x 1.80x
regular_w64_hv_8bpc_dotprod: 2.43x 2.59x 2.25x 2.43x 1.65x 1.66x
regular_w64_hv_8bpc_i8mm: 2.84x 3.04x 2.39x 2.52x 1.73x 1.76x
sharp_w64_hv_8bpc_neon: 1.43x 1.53x 1.47x 1.57x 1.49x 1.49x
sharp_w64_hv_8bpc_dotprod: 2.08x 2.24x 1.91x 2.07x 1.49x 1.28x
sharp_w64_hv_8bpc_i8mm: 2.30x 2.46x 2.07x 2.22x 1.48x 1.42x
regular_w128_hv_8bpc_neon: 1.77x 1.94x 1.84x 1.92x 1.75x 1.69x
regular_w128_hv_8bpc_dotprod: 2.37x 2.57x 2.18x 2.37x 1.70x 1.56x
regular_w128_hv_8bpc_i8mm: 2.76x 3.02x 2.33x 2.45x 1.78x 1.65x
sharp_w128_hv_8bpc_neon: 1.40x 1.53x 1.45x 1.54x 1.42x 1.44x
sharp_w128_hv_8bpc_dotprod: 2.04x 2.23x 1.87x 2.03x 1.43x 1.24x
sharp_w128_hv_8bpc_i8mm: 2.24x 2.45x 2.02x 2.17x 1.42x 1.38x
Horizontal micro benchmarks
A715-mct A715-mc X3-mct X3-mc A510-mct A510-mc
regular_w2_h_8bpc_neon: 2.42x
regular_w2_h_8bpc_dotprod: 3.75x
regular_w2_h_8bpc_i8mm: 4.22x
sharp_w2_h_8bpc_neon: 2.42x
sharp_w2_h_8bpc_dotprod: 3.76x
sharp_w2_h_8bpc_i8mm: 4.23x
regular_w4_h_8bpc_neon: 4.81x 4.11x
regular_w4_h_8bpc_dotprod: 9.14x 7.22x
regular_w4_h_8bpc_i8mm: 11.18x 8.12x
sharp_w4_h_8bpc_neon: 4.78x 4.10x
sharp_w4_h_8bpc_dotprod: 9.14x 7.17x
sharp_w4_h_8bpc_i8mm: 11.11x 8.10x
regular_w8_h_8bpc_neon: 3.16x 3.20x 3.51x 3.32x 3.43x 3.37x
regular_w8_h_8bpc_dotprod: 4.97x 5.12x 7.43x 7.27x 4.95x 5.06x
regular_w8_h_8bpc_i8mm: 7.28x 5.87x 10.38x 8.59x 5.69x 5.69x
sharp_w8_h_8bpc_neon: 2.71x 2.64x 2.77x 2.75x 3.10x 3.09x
sharp_w8_h_8bpc_dotprod: 4.92x 5.09x 7.14x 7.03x 4.94x 5.09x
sharp_w8_h_8bpc_i8mm: 7.21x 5.82x 10.11x 8.45x 5.70x 5.68x
regular_w16_h_8bpc_neon: 2.79x 2.61x 2.76x 2.75x 3.53x 3.22x
regular_w16_h_8bpc_dotprod: 3.81x 4.09x 4.77x 4.90x 3.13x 3.10x
regular_w16_h_8bpc_i8mm: 5.21x 4.55x 6.04x 5.66x 3.56x 3.23x
sharp_w16_h_8bpc_neon: 2.31x 2.22x 2.38x 2.36x 3.12x 2.89x
sharp_w16_h_8bpc_dotprod: 3.80x 4.10x 4.74x 4.87x 3.13x 3.09x
sharp_w16_h_8bpc_i8mm: 5.20x 4.55x 5.98x 5.61x 3.56x 3.22x
regular_w32_h_8bpc_neon: 2.58x 2.40x 2.61x 2.54x 3.14x 2.91x
regular_w32_h_8bpc_dotprod: 3.36x 3.54x 3.92x 4.03x 2.57x 2.11x
regular_w32_h_8bpc_i8mm: 4.48x 3.88x 4.81x 4.55x 2.91x 2.70x
sharp_w32_h_8bpc_neon: 2.15x 2.03x 2.19x 2.17x 2.78x 2.62x
sharp_w32_h_8bpc_dotprod: 3.33x 3.52x 3.90x 3.94x 2.57x 2.10x
sharp_w32_h_8bpc_i8mm: 4.45x 3.85x 4.79x 4.45x 2.89x 2.70x
regular_w64_h_8bpc_neon: 2.49x 2.31x 2.46x 2.41x 2.94x 2.79x
regular_w64_h_8bpc_dotprod: 3.17x 3.33x 3.60x 3.62x 2.41x 2.22x
regular_w64_h_8bpc_i8mm: 4.22x 3.63x 4.40x 4.08x 2.72x 2.53x
sharp_w64_h_8bpc_neon: 2.07x 1.97x 2.06x 2.05x 2.60x 2.49x
sharp_w64_h_8bpc_dotprod: 3.16x 3.32x 3.58x 3.58x 2.40x 2.21x
sharp_w64_h_8bpc_i8mm: 4.20x 3.63x 4.38x 4.04x 2.71x 2.51x
regular_w128_h_8bpc_neon: 2.45x 2.28x 2.38x 2.33x 2.78x 2.69x
regular_w128_h_8bpc_dotprod: 3.09x 3.25x 3.47x 3.47x 2.24x 2.23x
regular_w128_h_8bpc_i8mm: 4.10x 3.55x 4.25x 3.92x 2.52x 2.31x
sharp_w128_h_8bpc_neon: 2.05x 1.94x 2.01x 2.01x 2.47x 2.39x
sharp_w128_h_8bpc_dotprod: 3.09x 3.25x 3.44x 3.46x 2.24x 2.23x
sharp_w128_h_8bpc_i8mm: 4.10x 3.55x 4.22x 3.89x 2.52x 2.31x
Vertical micro benchmarks
A715-mct A715-mc X3-mct X3-mc A510-mct A510-mc
regular_w2_v_8bpc_neon: 3.68x
regular_w2_v_8bpc_dotprod: 3.29x
regular_w2_v_8bpc_i8mm: 3.49x
sharp_w2_v_8bpc_neon: 3.29x
sharp_w2_v_8bpc_dotprod: 3.27x
sharp_w2_v_8bpc_i8mm: 3.46x
regular_w4_v_8bpc_neon: 7.15x 5.62x
regular_w4_v_8bpc_dotprod: 7.43x 5.85x
regular_w4_v_8bpc_i8mm: 7.89x 6.20x
sharp_w4_v_8bpc_neon: 5.83x 4.71x
sharp_w4_v_8bpc_dotprod: 7.36x 5.85x
sharp_w4_v_8bpc_i8mm: 7.90x 6.18x
regular_w8_v_8bpc_neon: 6.11x 6.55x 8.05x 8.24x 4.07x 4.38x
regular_w8_v_8bpc_dotprod: 5.45x 5.61x 8.15x 7.00x 4.01x 4.30x
regular_w8_v_8bpc_i8mm: 7.30x 7.59x 9.46x 9.12x 4.19x 4.49x
sharp_w8_v_8bpc_neon: 4.23x 4.51x 5.46x 5.54x 3.09x 3.33x
sharp_w8_v_8bpc_dotprod: 5.43x 5.58x 7.96x 6.74x 4.01x 4.29x
sharp_w8_v_8bpc_i8mm: 7.26x 7.44x 9.12x 9.02x 4.19x 4.47x
regular_w16_v_8bpc_neon: 3.44x 3.61x 4.33x 4.52x 2.40x 2.36x
regular_w16_v_8bpc_dotprod: 3.20x 3.34x 4.53x 4.53x 2.85x 2.60x
regular_w16_v_8bpc_i8mm: 4.09x 4.33x 5.27x 5.53x 2.87x 2.62x
sharp_w16_v_8bpc_neon: 2.50x 2.61x 3.14x 3.31x 1.82x 1.81x
sharp_w16_v_8bpc_dotprod: 3.20x 3.34x 4.52x 4.51x 2.86x 2.62x
sharp_w16_v_8bpc_i8mm: 4.09x 4.32x 5.15x 5.55x 2.86x 2.65x
regular_w32_v_8bpc_neon: 2.94x 3.12x 3.52x 3.70x 1.81x 1.84x
regular_w32_v_8bpc_dotprod: 2.80x 2.95x 3.74x 3.75x 2.17x 2.06x
regular_w32_v_8bpc_i8mm: 3.54x 3.76x 4.19x 4.48x 2.16x 2.06x
sharp_w32_v_8bpc_neon: 2.14x 2.27x 2.58x 2.73x 1.37x 1.40x
sharp_w32_v_8bpc_dotprod: 2.78x 2.93x 3.70x 3.71x 2.17x 2.05x
sharp_w32_v_8bpc_i8mm: 3.50x 3.73x 4.15x 4.46x 2.18x 2.06x
regular_w64_v_8bpc_neon: 2.74x 2.88x 3.11x 3.33x 1.53x 1.65x
regular_w64_v_8bpc_dotprod: 2.63x 2.75x 3.30x 3.35x 1.84x 1.82x
regular_w64_v_8bpc_i8mm: 3.31x 3.48x 3.73x 3.99x 1.84x 1.82x
sharp_w64_v_8bpc_neon: 2.01x 2.12x 2.29x 2.45x 1.16x 1.25x
sharp_w64_v_8bpc_dotprod: 2.61x 2.75x 3.27x 3.32x 1.83x 1.82x
sharp_w64_v_8bpc_i8mm: 3.29x 3.48x 3.68x 3.94x 1.84x 1.82x
regular_w128_v_8bpc_neon: 2.66x 2.80x 2.92x 3.16x 1.39x 1.53x
regular_w128_v_8bpc_dotprod: 2.56x 2.68x 3.11x 3.18x 1.63x 1.69x
regular_w128_v_8bpc_i8mm: 3.21x 3.39x 3.48x 3.78x 1.63x 1.69x
sharp_w128_v_8bpc_neon: 1.95x 2.06x 2.16x 2.34x 1.06x 1.17x
sharp_w128_v_8bpc_dotprod: 2.55x 2.68x 3.10x 3.17x 1.63x 1.69x
sharp_w128_v_8bpc_i8mm: 3.19x 3.37x 3.49x 3.76x 1.63x 1.69x
Some benchmark results against Armv8.4-A (DotProd) version:
- AWS Graviton 3: 178.16 fps -> 183.38 fps ( +2.93 % )
- AWS Graviton 3: 162.45 fps -> 166.60 fps ( +2.55 % )
- AWS Graviton 3: 133.95 fps -> 136.51 fps ( +1.91 % )
- AWS Graviton 3: 130.15 fps -> 132.68 fps ( +1.94 % )
- AWS Graviton 3: 192.59 fps -> 197.09 fps ( +2.34 % )
- AWS Graviton 3: 213.57 fps -> 226.32 fps ( +5.97 % )
Bosphorus 1080p was encoded by aomenc (3.7.1+):
aomenc --good --cpu-used=5 -w 1920 -h 1080 --bit-depth=8 --ivf -o Bosphorus_1080p_8bit.ivf Bosphorus_1920x1080_120fps_420_8bit_YUV.y4m