Avoid excessive L2 collisions with certain frame widths
Improves the overall decoding speed of a sample 2048x1024 clip by a few percent on my machine due to fewer cache misses, although this number will probably vary wildly depending on hardware. Should likely help more with even larger power-of-two widths.
The offset where addresses map to the same set for a given cache implementation can (generally) be calculated using size / associativity
.
For example a 128 kB cache with an 8-way associativity has a "critical stride" of 16 kB. With a stride of 2048 every 8th row will map to the same set, and each set can only hold as many entries as the associativity, so after accessing more than 64 consecutive rows the earlier rows will start being evicted from the cache, which is something we don't want to happen.