Split MC blend
The mstride == 0, mstride == 1, and mstride == w cases are very different from each other, and splitting them into separate functions makes it easier top optimize them.
Also add some further optimizations to the AVX2 asm that became possible after this change.