riscv64/mc: Add bidir functions
This code compromises between the performance of a dedicated kernel per VLEN/width pair, and the flexibility of a fully VLEN-dynamic loop, by using a special case for w=4, and subdividing the rest into the w8/16/32 fast-paths (unrolled four lines per iter) and the w64/128 slow paths (one line per iter). For vl256, the fast path also covers w64.
Kendryte K230
avg_w4_8bpc_c: 346.7 ( 1.00x)
avg_w4_8bpc_rvv: 56.3 ( 6.16x)
avg_w8_8bpc_c: 1054.7 ( 1.00x)
avg_w8_8bpc_rvv: 139.0 ( 7.59x)
avg_w16_8bpc_c: 3398.6 ( 1.00x)
avg_w16_8bpc_rvv: 350.1 ( 9.71x)
avg_w32_8bpc_c: 13726.7 ( 1.00x)
avg_w32_8bpc_rvv: 1246.9 (11.01x)
avg_w64_8bpc_c: 33217.0 ( 1.00x)
avg_w64_8bpc_rvv: 3789.6 ( 8.77x)
avg_w128_8bpc_c: 83483.2 ( 1.00x)
avg_w128_8bpc_rvv: 9786.0 ( 8.53x)
w_avg_w4_8bpc_c: 441.6 ( 1.00x)
w_avg_w4_8bpc_rvv: 72.5 ( 6.10x)
w_avg_w8_8bpc_c: 1364.8 ( 1.00x)
w_avg_w8_8bpc_rvv: 200.3 ( 6.81x)
w_avg_w16_8bpc_c: 4417.6 ( 1.00x)
w_avg_w16_8bpc_rvv: 562.7 ( 7.85x)
w_avg_w32_8bpc_c: 17890.7 ( 1.00x)
w_avg_w32_8bpc_rvv: 2093.5 ( 8.55x)
w_avg_w64_8bpc_c: 43231.4 ( 1.00x)
w_avg_w64_8bpc_rvv: 5739.2 ( 7.53x)
w_avg_w128_8bpc_c: 107984.3 ( 1.00x)
w_avg_w128_8bpc_rvv: 14283.1 ( 7.56x)
mask_w4_8bpc_c: 497.5 ( 1.00x)
mask_w4_8bpc_rvv: 92.8 ( 5.36x)
mask_w8_8bpc_c: 1529.7 ( 1.00x)
mask_w8_8bpc_rvv: 253.4 ( 6.04x)
mask_w16_8bpc_c: 4964.3 ( 1.00x)
mask_w16_8bpc_rvv: 680.4 ( 7.30x)
mask_w32_8bpc_c: 20288.7 ( 1.00x)
mask_w32_8bpc_rvv: 3015.6 ( 6.73x)
mask_w64_8bpc_c: 49726.6 ( 1.00x)
mask_w64_8bpc_rvv: 7210.9 ( 6.90x)
mask_w128_8bpc_c: 126687.1 ( 1.00x)
mask_w128_8bpc_rvv: 18279.9 ( 6.93x)
Edited by Niklas Haas