Optimize the cdef_filter C implementation
Performance numbers, measured on Skylake-X:
Before: After:
cdef_filter_4x4_8bpc_c: 1217.0 cdef_filter_4x4_8bpc_c: 885.2
cdef_filter_4x8_8bpc_c: 2355.1 cdef_filter_4x8_8bpc_c: 1710.1
cdef_filter_8x8_8bpc_c: 2669.5 cdef_filter_8x8_8bpc_c: 1439.7
For 10-bit (which currently uses C DSP code) the overall decoding performance is increased by around 20%.
The asm can also be optimized using the same approach, although the benefit will likely be a bit smaller.