Provide implementations for sad, sad_xN, ssd functions using dotprod instructions on aarch64 (!171) · Merge requests · VideoLAN / x264

Konstantinos Margaritis requested to merge markos/x264:feature/aarch64-dotprod-optimizations into master Mar 04, 2025

Based on the groundwork of !169 (merged), provide implementations for functions using the instructions SDOT/UDOT in the DotProd Armv8 extension.

Functions implemented: sad_16x8, sad_16x16, sad_x3_16x8_neon, sad_x3_16x16_neon, sad_x4_16x8_neon, sad_x4_16x16_neon, ssd_8x4, ssd_8x8, ssd_8x16, ssd_16x8, ssd_16x16, pixel_vsad

Performance improvement against Neon ranges from 5% to 188%.

Following is the output of ./checkasm8 --bench (run on a Graviton4 system):

sad_16x8_c: 1324
sad_16x8_neon: 222
sad_16x8_dotprod: 211
sad_16x16_c: 2535
sad_16x16_neon: 344
sad_16x16_dotprod: 325
sad_x3_16x8_c: 3837
sad_x3_16x8_neon: 415
sad_x3_16x8_dotprod: 329
sad_x3_16x16_c: 7724
sad_x3_16x16_neon: 722
sad_x3_16x16_dotprod: 546
sad_x4_16x8_c: 5080
sad_x4_16x8_neon: 438
sad_x4_16x8_dotprod: 377
sad_x4_16x16_c: 10263
sad_x4_16x16_neon: 784
sad_x4_16x16_dotprod: 652
ssd_8x4_c: 381
ssd_8x4_neon: 163
ssd_8x4_dotprod: 133
ssd_8x4_sve: 150
ssd_8x8_c: 695
ssd_8x8_neon: 237
ssd_8x8_dotprod: 158
ssd_8x8_sve: 228
ssd_8x16_c: 1335
ssd_8x16_neon: 387
ssd_8x16_dotprod: 260
ssd_16x8_c: 1342
ssd_16x8_neon: 285
ssd_16x8_dotprod: 167
ssd_16x16_c: 2622
ssd_16x16_neon: 503
ssd_16x16_dotprod: 267
vsad_c: 2782
vsad_neon: 287
vsad_dotprod: 229

The ssd ones are faster than the _sve ones, which brings of the point of how to choose the functions when both implementations are available (eg on a Graviton3/4 system).

Edited Mar 06, 2025 by Konstantinos Margaritis

Provide implementations for sad, sad_xN, ssd functions using dotprod instructions on aarch64

Merge request reports