AVX2: Swap shuffles with zen 2/3 friendly equivalents
On zen 2 and 3, vpermq is slower than vperm2i128. In some assembly, we use the former to swap lanes of a vector when we could be using the latter.
On current intel cpus, these instructions are equally expensive, so there should be no impact there.