mc: Reduce stack use in {put,prep}_scaled_{bilin,8tap}
For the bilin cases, this seems to make things marginally faster (measured on x86_64; 7-25% faster with compiler autovectorization). For 8tap, it doesn't make much of a difference at all.
Before: GCC Clang
mc_scaled_8tap_regular_w128_8bpc_c: 115155.5 98549.3
mc_scaled_8tap_regular_w128_8bpc_ssse3: 17936.0 18411.1
mc_scaled_bilinear_w128_8bpc_c: 40290.0 51812.9
mc_scaled_bilinear_w128_8bpc_ssse3: 18243.9 18177.0
After:
mc_scaled_8tap_regular_w128_8bpc_c: 116304.3 99453.2
mc_scaled_8tap_regular_w128_8bpc_ssse3: 18387.0 18077.3
mc_scaled_bilinear_w128_8bpc_c: 37381.4 41145.0
mc_scaled_bilinear_w128_8bpc_ssse3: 18423.8 18031.6
(Benchmarked with the seed 0; the total runtime for the scaled benchmarks are significantly affected by the random seed.)
This reduces the stack usage of these functions from around 65 KB each, to less than 1 KB for bilin, and around 2 KB for 8tap.
With this in place, the required stack space for dav1d should be mostly identical across configurations; on x86_64 (both with and without assembly), it can run with 62 KB of stack, and on arm and aarch64, it can run with 58 KB of stack.