Don't trash the return stack buffer in the NEON loop filter
The NEON loop filter's innermost asm function can return to a different location than the address that called it. This messes up the return stack predictor, causing returns to be mispredicted
Instead, rework the function to always return to the address that calls it, and instead return the information needed for the caller to short-circuit storing pixels