Weird issue regarding multi-threaded decoding performance of grain synthesis in regards to AVX2 vs SSE 4.1

What steps will reproduce the problem?

Build dav1d from Git.
Decode a 10b AV1 .ivf file that has been encoded with grain synthesis in dav1d.

What is the expected output?

Using the AVX2 CPU mask is faster than SSE 4.1 both in single-threaded and multi-threaded decoding workloads.

What do you see instead?

Single-threaded: AVX2 is faster than SSE41.
Multi-threaded(8 threads): SSE41 is both faster per thread and somehow manages to get higher CPU utilization(IE, better thread scaling).

What version / commit were you testing with? (git describe can produce this info if building from source). On what operating system?

dav1d 0.9.2-57-g2337127c

Openmandriva 4.3RC2

Hardware: Zen 2 Ryzen 7 3700X with 4C/8T selected to remove the interconnect communication bottleneck during testing.

Please provide any additional information below.

Decoding without film grain synthesis active gives us the expected performance delta, with AVX2 decoding being about 20% faster than SSE 4.1. SSSE3 is slower than AVX2 and SSE 4.1 in all cases.

3 files have been included below:

The AV1 10b grain synth .ivf file used to test decoding performance. I have not yet tested 12b or 8b performance.
The 1st text file named "dav1d_simd_behavior_MT" describes multi-threaded decoding performance of the file tested with different compilers and build settings utilizing 8 threads of the processor(4C/8T). I did not expect there to be much difference since dav1d's SIMD code seems to be complete in this area, but I had to make sure.
The 2nd text file named "dav1d_simd_behavior_MT" describes single-threaded decoding performance of the file tested with different compilers and build settings.

If required, I can get performance profiling to see what functions take the most time during decoding that might explain the performance difference.

dav1d_test_files.7z

Edit 1:

With the help of some of my friends, I've managed to measure the performance of the different implementations of film grain functions using checkasm for these architectures:

Zen 2:

8bpc film grain synthesis Zen 2: https://pastebin.com/hhxciYKg
16bpc film grain synthesis Zen 2: https://pastebin.com/1nt7xyye
8bpc loop restoration Zen 2: https://pastebin.com/4EM9yxdu
16bpc loop restoration Zen 2: https://pastebin.com/Mt292qpJ

Zen 3

8bpc film grain synthesis Zen 3: https://pastebin.com/1iLGxckr
16bpc film grain synthesis Zen 3: https://pastebin.com/12XVdAK2
8bpc loop restoration Zen 3: https://pastebin.com/dVvepbG8
16bpc loop restoration Zen 3: https://pastebin.com/8dVgW4Hw

Haswell

A file has been included since the guy testing it was very nice to directly pipe to text files Haswell__dav1d_simd_Results.zip

Skylake(to show what a full AVX2 implementation can do)

Skylake__dav1d_simd_Results.zip

Edited Nov 13, 2021 by Zak

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information