AArch64: Trim Armv8.0 Neon path of 6-tap and 8-tap MC functions
There are some instruction sequences we could merge after the lane load/store patch (!1722 (merged)).
This change will simplify the loading of filter weights to save 288 bytes in the Armv8.0 Neon path of 6-tap and 8-tap MC functions.