Ahh ok. I’ve had a similar conversation with a graphics guy that strongly paralleled this conversation, and I was confused then, too. Seemingly the asynchronous compute issues are a limitation when it only comes to shaders I guess? I write some CUDA but no shaders, and certainly not so much that I’d make super concrete claims about more esoteric features like streams.
The GPU is able to run up to 128 kernels in parallel on modern NVIDIA GPUs, concurrently.