x86: Improve AVX2 generate_grain asm
HSW SKL
old new old new
gen_grain_y_ar0_8bpc_avx2: 19298.9 15342.1 18711.9 17842.5
gen_grain_y_ar1_8bpc_avx2: 63983.7 56378.3 59358.9 57213.2
gen_grain_y_ar2_8bpc_avx2: 86822.1 78599.7 92137.2 90092.4
gen_grain_y_ar3_8bpc_avx2: 88543.4 80883.7 97682.3 94461.6
gen_grain_uv_ar0_8bpc_420_avx2: 5742.2 4976.4 6157.3 6049.9
gen_grain_uv_ar1_8bpc_420_avx2: 15999.0 15003.4 15211.3 15038.6
gen_grain_uv_ar2_8bpc_420_avx2: 21082.1 19496.9 22533.4 22401.5
gen_grain_uv_ar3_8bpc_420_avx2: 23159.6 19810.1 25008.9 22810.5
gen_grain_uv_ar0_8bpc_422_avx2: 11475.5 9300.6 11404.1 11278.6
gen_grain_uv_ar1_8bpc_422_avx2: 31161.1 29267.6 29484.3 29482.6
gen_grain_uv_ar2_8bpc_422_avx2: 41179.1 38221.5 44523.1 44358.9
gen_grain_uv_ar3_8bpc_422_avx2: 46002.3 39058.1 49497.1 45007.2
gen_grain_uv_ar0_8bpc_444_avx2: 20684.6 16200.5 21090.6 20429.6
gen_grain_uv_ar1_8bpc_444_avx2: 62772.6 56551.0 58890.7 57936.0
gen_grain_uv_ar2_8bpc_444_avx2: 80320.2 74349.6 87507.6 86792.5
gen_grain_uv_ar3_8bpc_444_avx2: 89649.4 76560.7 97022.6 88563.1
gen_grain_y_ar0_16bpc_avx2: 19713.1 15822.2 18787.0 17809.5
gen_grain_y_ar1_16bpc_avx2: 61425.5 58064.0 58335.0 57371.9
gen_grain_y_ar2_16bpc_avx2: 78416.5 74290.5 87864.8 85194.7
gen_grain_y_ar3_16bpc_avx2: 81198.8 75390.7 91357.7 87434.1
gen_grain_uv_ar0_16bpc_420_avx2: 6176.9 5290.4 6105.0 5997.7
gen_grain_uv_ar1_16bpc_420_avx2: 16173.8 15294.2 15064.0 15205.6
gen_grain_uv_ar2_16bpc_420_avx2: 20498.5 19281.6 22811.8 22490.3
gen_grain_uv_ar3_16bpc_420_avx2: 21811.7 19930.5 23423.7 22575.6
gen_grain_uv_ar0_16bpc_422_avx2: 11959.1 9805.9 11332.1 11142.3
gen_grain_uv_ar1_16bpc_422_avx2: 31821.6 29928.0 29457.0 29673.6
gen_grain_uv_ar2_16bpc_422_avx2: 39711.1 38179.8 44745.5 44483.1
gen_grain_uv_ar3_16bpc_422_avx2: 41998.4 38771.1 46371.9 44546.4
gen_grain_uv_ar0_16bpc_444_avx2: 21107.8 17165.8 21581.3 20343.0
gen_grain_uv_ar1_16bpc_444_avx2: 61445.1 58489.6 58594.0 58647.9
gen_grain_uv_ar2_16bpc_444_avx2: 78127.2 74867.5 87758.6 87106.2
gen_grain_uv_ar3_16bpc_444_avx2: 81489.5 76197.8 91190.6 87434.1
gen_grain_uv_ar1_16bpc
is very marginally slower on Skylake after the changes due to using scalar loads instead of gathers, but everything else is a win across the board. Should be even more beneficial on AMD CPUs with notoriously poor gather performance.
Partially addresses #377 (closed) due to grain generation functions no longer using gathers. The issue still remains for the main film grain functions though.