I can't vouch for whether or not it can beat human experts though because I'm no CUDA expert myself. The original CUDA code were human written and I first let codex adapt it to my specific use case. Then I basically let codex generate ideas and try the ideas out itself (I think it's a bit like Karpathy's autoresearch, except I was still doing manual prompting). And that was enough to get me 20x improvement.
I suspect when people said AI wrote non performant CUDA kernels it was beginning-mid last year and it's definitely vastly improved since back then. And the agent's ability to iteratively improve really impressed me.
I suspect when people said AI wrote non performant CUDA kernels it was beginning-mid last year and it's definitely vastly improved since back then. And the agent's ability to iteratively improve really impressed me.