arm64: filmgrain16: Use sqrdmulh for the scaling*grain multiplication
Before: Cortex A53 A72 A73 Apple M1
fgy_32x32xn_16bpc_neon: 10396.8 8150.8 8718.3 19.5
After:
fgy_32x32xn_16bpc_neon: 9665.1 7558.8 7652.8 19.5