x86: Add high bit-depth loopfilter AVX-512 (Ice Lake) asm
Overall a decent amount faster than AVX2, vertical being more beneficial than horizontal mainly due to the transposes in the latter being a bit of a bottleneck (current Intel CPUs can do two 256-bit shuffles or one 512-bit shuffle per cycle).
w4 w8 w16
lpf_v_sb_y_16bpc_avx2: 184.3 370.9 544.6
lpf_v_sb_y_16bpc_avx512icl: 111.7 210.2 336.4
lpf_h_sb_y_16bpc_avx2: 321.8 546.1 844.6
lpf_h_sb_y_16bpc_avx512icl: 253.9 405.9 717.7
w4 w6
lpf_v_sb_uv_16bpc_avx2: 95.4 161.2
lpf_v_sb_uv_16bpc_avx512icl: 59.2 90.3
lpf_h_sb_uv_16bpc_avx2: 163.3 236.2
lpf_h_sb_uv_16bpc_avx512icl: 133.1 168.9