x86: Add 8-bit AVX-512 (Ice Lake) asm
Overall performance of SSE4.1 vs AVX2 vs AVX-512 on an 8-core/16-thread Intel Rocket Lake system:
On this system AVX-512 speeds up overall decoding performance by around 10-20% over AVX2 on low thread counts. On high thread counts the improvement shrinks to around 5%, mainly due to DRAM bandwidth becoming more of a bottleneck, with the CPU spending an ever increasing portion of overall runtime waiting on memory instead of doing any useful work. Dual-channel DDR4 is clearly not cutting it anymore, and faster memory/more memory channels/more L3 cache would be helpful.
On an AWS m6i.4xlarge (16 vCPU Ice Lake-SP) instance which has more DRAM bandwidth available the performance delta between AVX2 and AVX-512 remains more consistent across a wide range of thread counts:
When it comes to power consumption I made some power measurements according to the CPU SVID for real-time decoding of some 4K samples (1080p barely puts any load on the CPU, so the power usage is hardly above idle without any differences between instruction sets):
Avg. power usage | SSE4.1 | AVX2 | AVX-512 |
---|---|---|---|
HoliFestival | 49.7 W | 46.1 W | 42.5 W |
SummerNature | 42.8 W | 43.2 W | 43.5 W |
SummerInTomsk | 40.3 W | 39.6 W | 37.4 W |
Overall wider SIMD generally results in better power efficiency. The outlier is SummerNature, which can likely be explained by the fact that it only contains static shots with little or no movement at a very high bitrate, which results in the CPU time being spent very differently compared to other clips.
The current generation Intel µarchitectures has 3 SIMD execution units (p0, p1, p5), two of which (p0, p1) are 256-bit and one (p5) 512-bit. On client CPUs p5 can only execute shuffles and basic arithmetical/logical operations (add/sub/and/or/xor etc.), on server CPUs p5 is also capable of executing more complex arithmetic instructions (almost everything, in fact). p0+p1 can fuse into a single combined 512-bit unit in order to perform 512-bit operations, which allows for either 3x256-bit or 2x512-bit per cycle, so pure throughput under ideal circumstances with an instruction mix where all execution units can be fully utilized is increased by 33% when using AVX-512 compared to AVX2. New instructions, like VNNI and VBMI, improves things further though by reducing the number of instructions required to perform certain calculations.
It's somewhat content dependent, but around half of the overall runtime in the decoder is spent in scalar code which doesn't benefit from SIMD, and some of the DSP code operates on small blocks that doesn't benefit much, if any, from wider SIMD.