Advice on porting to an HPC application to GPU

I’m not certain on the details, but my understanding is that there’s not a limit on the number of kernels that can be launched, but there is a limit to the number of kernels on the launch queue. So once the queue fills up, the next kernel launch need to block waiting for a spot to open up.

I found this post which you might find helpful: