Buffer cdef input in 8 block wide chunks
Also, implement new CDEF for AVX2. The AVX2 implementation offsets the input pixels by 128 and interleaves the chroma planes.
Profiling results from zen2.
NEW:
(Filters 2 chroma planes at once)
cdef_filter_uv_4x4_01_8bpc_avx2: 102.2
cdef_filter_uv_4x4_10_8bpc_avx2: 79.7
cdef_filter_uv_4x4_11_8bpc_avx2: 199.6
cdef_filter_uv_4x8_01_8bpc_avx2: 171.9
cdef_filter_uv_4x8_10_8bpc_avx2: 128.3
cdef_filter_uv_4x8_11_8bpc_avx2: 251.6
cdef_filter_uv_8x8_01_8bpc_avx2: 294.7
cdef_filter_uv_8x8_10_8bpc_avx2: 240.6
cdef_filter_uv_8x8_11_8bpc_avx2: 436.2
cdef_filter_y_01_8bpc_avx2: 188.9
cdef_filter_y_10_8bpc_avx2: 112.8
cdef_filter_y_11_8bpc_avx2: 241.6
(Prepares the input buffer for cdef)
cdef_prep_uv_4x4_8bpc_avx2: 60.0
cdef_prep_uv_4x8_8bpc_avx2: 90.5
cdef_prep_uv_8x8_8bpc_avx2: 126.0
cdef_prep_y_8bpc_avx2: 81.6
Runtime from a WebRTC sample clip (about ~10% spent in cdef):
Time (mean ± σ): 881.6 ms ± 6.6 ms [User: 873.3 ms, System: 6.7 ms]
Range (min … max): 874.7 ms … 898.2 ms 20 runs
OLD:
cdef_filter_4x4_01_8bpc_avx2: 86.8
cdef_filter_4x4_10_8bpc_avx2: 80.3
cdef_filter_4x4_11_8bpc_avx2: 116.5
cdef_filter_4x8_01_8bpc_avx2: 122.8
cdef_filter_4x8_10_8bpc_avx2: 90.2
cdef_filter_4x8_11_8bpc_avx2: 176.2
cdef_filter_8x8_01_8bpc_avx2: 171.2
cdef_filter_8x8_10_8bpc_avx2: 122.2
cdef_filter_8x8_11_8bpc_avx2: 247.7
Time (mean ± σ): 909.1 ms ± 4.8 ms [User: 900.6 ms, System: 6.1 ms]
Range (min … max): 902.5 ms … 921.7 ms 20 runs
Edited by Kyle Siefring