Rework the usage of noskip_mask
Remove half of the masks since they are only used for cdef on a 8x8 level of granularity.
Load the mask and combine the 16-bit sections into the 32-bit sections outside of the inner cdef loop. This should save some registers.
Results in mild performance improvements.