Skip to content

mc: Reduce stack use in {put,prep}_scaled_{bilin,8tap}

Martin Storsjö requested to merge mstorsjo/dav1d:mc-scaled-rewrite into master

For the bilin cases, this seems to make things marginally faster (measured on x86_64; 7-25% faster with compiler autovectorization). For 8tap, it doesn't make much of a difference at all.

Before:                                      GCC   Clang
mc_scaled_8tap_regular_w128_8bpc_c:     115155.5   98549.3
mc_scaled_8tap_regular_w128_8bpc_ssse3:  17936.0   18411.1
mc_scaled_bilinear_w128_8bpc_c:          40290.0   51812.9
mc_scaled_bilinear_w128_8bpc_ssse3:      18243.9   18177.0
After:
mc_scaled_8tap_regular_w128_8bpc_c:     116304.3   99453.2
mc_scaled_8tap_regular_w128_8bpc_ssse3:  18387.0   18077.3
mc_scaled_bilinear_w128_8bpc_c:          37381.4   41145.0
mc_scaled_bilinear_w128_8bpc_ssse3:      18423.8   18031.6

(Benchmarked with the seed 0; the total runtime for the scaled benchmarks are significantly affected by the random seed.)

This reduces the stack usage of these functions from around 65 KB each, to less than 1 KB for bilin, and around 2 KB for 8tap.

With this in place, the required stack space for dav1d should be mostly identical across configurations; on x86_64 (both with and without assembly), it can run with 62 KB of stack, and on arm and aarch64, it can run with 58 KB of stack.

Merge request reports

Loading