x86: Add 6-tap variants of high bit-depth mc AVX-512 (Ice Lake) functions
Checkasm numbers on Zen 4:
8-tap 6-tap
w2_v: 21.7 19.5
w2_hv: 30.5 28.5
w4_v: 24.0 21.3
w4_hv: 33.6 32.0
w8_h: 33.6 28.7
w8_v: 37.9 35.9
w8_hv: 62.4 47.4
w16_h: 70.1 57.0
w16_v: 59.7 53.5
w16_hv: 143.9 117.3
w32_h: 173.3 137.2
w32_v: 125.0 101.0
w32_hv: 414.7 251.1
w64_h: 606.5 484.2
w64_v: 430.5 347.2
w64_hv: 1387.0 848.9
w128_h: 1709.9 1354.8
w128_v: 1207.2 994.3
w128_hv: 3831.9 2361.9
This also removes the (quite large in terms of code size) hv_w32
path for 8-tap as it was only around 2% faster compared to the hv_w16
path, and it was rarely used now that a separate 6-tap implementation exists. The hv_w32
path in 6-tap is around 5% faster than the corresponding hv_w16
path so that's worth keeping.