thanks :) the braiding approach is super clever too, this was one of those weird moments where you find something and then have to triple check your results because how could i accidentally find something better than the algorithm that hasn't been touched in decades...
the part i really like is that it gives us small improvement on the pclmul too, as the non-accelerated algorithm doesn't really stand a chance against the accelerated opcode on newer hardware so it probably isn't going to see much use in practice. however... i think hardware solutions could possibly benefit (e.g. ethernet cards)
I was looking at the zlib-ng crc32 implementation which is where I saw that it was recently updated to include your algorithm.
Good work, it's a surprisingly elegant solution when compared to the braiding approaches!