x86: add AVX512-IceLake implementation of HBD 32x64 DCT^2
inv_txfm_add_32x64_dct_dct_0_10bpc_c: 1783.5 ( 1.00x)
inv_txfm_add_32x64_dct_dct_0_10bpc_sse4: 243.3 ( 7.33x)
inv_txfm_add_32x64_dct_dct_0_10bpc_avx2: 119.1 (14.97x)
inv_txfm_add_32x64_dct_dct_0_10bpc_avx512icl: 142.6 (12.50x)
inv_txfm_add_32x64_dct_dct_1_10bpc_c: 50422.5 ( 1.00x)
inv_txfm_add_32x64_dct_dct_1_10bpc_sse4: 2880.5 (17.50x)
inv_txfm_add_32x64_dct_dct_1_10bpc_avx2: 1423.4 (35.43x)
inv_txfm_add_32x64_dct_dct_1_10bpc_avx512icl: 741.6 (67.99x)
inv_txfm_add_32x64_dct_dct_2_10bpc_c: 50433.6 ( 1.00x)
inv_txfm_add_32x64_dct_dct_2_10bpc_sse4: 4015.1 (12.56x)
inv_txfm_add_32x64_dct_dct_2_10bpc_avx2: 1767.7 (28.53x)
inv_txfm_add_32x64_dct_dct_2_10bpc_avx512icl: 960.8 (52.49x)
inv_txfm_add_32x64_dct_dct_3_10bpc_c: 50422.2 ( 1.00x)
inv_txfm_add_32x64_dct_dct_3_10bpc_sse4: 4500.5 (11.20x)
inv_txfm_add_32x64_dct_dct_3_10bpc_avx2: 2111.7 (23.88x)
inv_txfm_add_32x64_dct_dct_3_10bpc_avx512icl: 1777.1 (28.37x)
inv_txfm_add_32x64_dct_dct_4_10bpc_c: 50444.2 ( 1.00x)
inv_txfm_add_32x64_dct_dct_4_10bpc_sse4: 5592.8 ( 9.02x)
inv_txfm_add_32x64_dct_dct_4_10bpc_avx2: 2458.1 (20.52x)
inv_txfm_add_32x64_dct_dct_4_10bpc_avx512icl: 1867.2 (27.02x)
As with the 16x64, the dc-only is a bit slower than AVX2, which is apparently an issue on my testing side (@gramner could not reproduce).