arm64: filmgrain: Fix overflows in gen_grain
After multiplying two int8_t, the maximum possible output is -128*-128 = 16384. One can't add two such values in an int16_t (even if all the products of all other int8_t combinations can be).
Previously the summing used 16 bit intermediates for the sum of two products and only lengtheted the result to 32 bit when accumulating three or more products.
Before: Cortex A53 A72 A73 Apple M1
gen_grain_y_ar1_8bpc_neon: 112598.5 71309.2 74889.8 372.2
gen_grain_y_ar2_8bpc_neon: 139932.4 91442.3 95788.4 387.3
gen_grain_y_ar3_8bpc_neon: 185607.6 115691.6 131655.8 403.0
After:
gen_grain_y_ar1_8bpc_neon: 112968.8 71897.9 76171.2 371.2
gen_grain_y_ar2_8bpc_neon: 142768.8 94517.9 97934.4 387.5
gen_grain_y_ar3_8bpc_neon: 191625.2 121083.0 135975.3 405.6