arm64: warped motion: Various optimizations (!1146) · Merge requests · VideoLAN / dav1d

Reorder loads of filters to benifit in order cores.
Use full 128-bit vectors to transpose 8x8 bytes. zip1 is called in the first stage which will hurt performance on some older big cores.
Rework horz stage for 8 bit mode:
- Use smull instead of mul
- Replace existing narrow and long instructions
- Replace mov after calling with right shift

Cortex-A55 Before: warp_8x8_8bpc_neon: 1683.2 warp_8x8_16bpc_neon: 1870.7 warp_8x8t_8bpc_neon: 1673.2 warp_8x8t_16bpc_neon: 1848.0

After: warp_8x8_8bpc_neon: 1267.2 warp_8x8_16bpc_neon: 1769.8 warp_8x8t_8bpc_neon: 1245.4 warp_8x8t_16bpc_neon: 1747.3

arm64: warped motion: Various optimizations