Wiener optimizations
Improves overall decoding performance on AVX2-capable systems by around 1-3% depending on content.
wiener_7tap_8bpc_c: 203223.0
wiener_7tap_8bpc_sse2: 33425.1 (previously: 45781.5)
wiener_7tap_8bpc_ssse3: 21980.3 (previously: 30153.3)
wiener_7tap_8bpc_avx2: 12097.5 (previously: 17262.9)
wiener_5tap_8bpc_sse2: 26902.8
wiener_5tap_8bpc_ssse3: 19829.6
wiener_5tap_8bpc_avx2: 10592.6
Less cache thrashing benefits surrounding code as well, so the checkasm numbers doesn't paint the whole picture.
Edited by Henrik Gramner