arm64: warped motion: Various optimizations
- Reorder loads of filters to benifit in order cores.
- Use full 128-bit vectors to transpose 8x8 bytes. zip1 is called in the first stage which will hurt performance on some older big cores.
- Rework horz stage for 8 bit mode:
- Use smull instead of mul
- Replace existing narrow and long instructions
- Replace mov after calling with right shift
Cortex-A55 Before: warp_8x8_8bpc_neon: 1683.2 warp_8x8_16bpc_neon: 1870.7 warp_8x8t_8bpc_neon: 1673.2 warp_8x8t_16bpc_neon: 1848.0
After: warp_8x8_8bpc_neon: 1267.2 warp_8x8_16bpc_neon: 1769.8 warp_8x8t_8bpc_neon: 1245.4 warp_8x8t_16bpc_neon: 1747.3