x86: Add minor loopfilter asm improvements
AVX2 changes are mainly for code size reductions by sharing common code between luma and chroma functions, but the 8-bit AVX-512 changes also includes some small speedups due to more efficient mask calculations:
lpf_h_sb_uv_w4_8bpc_avx512icl: 131.0 -> 129.3
lpf_h_sb_uv_w6_8bpc_avx512icl: 178.9 -> 172.8
lpf_h_sb_y_w4_8bpc_avx512icl: 234.0 -> 228.6
lpf_h_sb_y_w8_8bpc_avx512icl: 384.7 -> 375.8
lpf_h_sb_y_w16_8bpc_avx512icl: 620.8 -> 587.7
lpf_v_sb_uv_w4_8bpc_avx512icl: 32.9 -> 31.1
lpf_v_sb_uv_w6_8bpc_avx512icl: 67.9 -> 64.4
lpf_v_sb_y_w4_8bpc_avx512icl: 64.7 -> 63.2
lpf_v_sb_y_w8_8bpc_avx512icl: 185.2 -> 175.6
lpf_v_sb_y_w16_8bpc_avx512icl: 350.6 -> 314.6