arm64: itx16: Use usqadd to avoid separate clamping of negative values
Before: Cortex A53 A72 A73
inv_txfm_add_4x4_dct_dct_0_10bpc_neon: 40.7 23.0 24.0
inv_txfm_add_4x4_dct_dct_1_10bpc_neon: 116.0 71.5 78.2
inv_txfm_add_8x8_dct_dct_0_10bpc_neon: 85.7 50.7 53.8
inv_txfm_add_8x8_dct_dct_1_10bpc_neon: 287.0 203.5 215.2
inv_txfm_add_16x16_dct_dct_0_10bpc_neon: 255.7 129.1 140.4
inv_txfm_add_16x16_dct_dct_1_10bpc_neon: 1401.4 1026.7 1039.2
inv_txfm_add_16x16_dct_dct_2_10bpc_neon: 1913.2 1407.3 1479.6
After:
inv_txfm_add_4x4_dct_dct_0_10bpc_neon: 38.7 21.5 22.2
inv_txfm_add_4x4_dct_dct_1_10bpc_neon: 116.0 71.3 77.2
inv_txfm_add_8x8_dct_dct_0_10bpc_neon: 76.7 44.7 43.5
inv_txfm_add_8x8_dct_dct_1_10bpc_neon: 278.0 203.0 203.9
inv_txfm_add_16x16_dct_dct_0_10bpc_neon: 236.9 106.2 116.2
inv_txfm_add_16x16_dct_dct_1_10bpc_neon: 1368.7 999.7 1008.4
inv_txfm_add_16x16_dct_dct_2_10bpc_neon: 1880.5 1381.2 1459.4