Sorry, coulnd’t find my answer anywhere. For concurrent kernel execution, is it such that multiple kernels are running at the exact same time on one multiprocessor, or is it such that once one multiprocessor is freed, another kernel can launch on it.
For example, if you have two kernels, both with global read, ALU and global write, is it possible that, for one multiprocessor, the GPU is executing a global read for kernel 2 while, at the exact same time, doing ALUs from kernel 1? Or is it such that once one multiprocessor is done executing kernel 1 it can immediately begin to execute kernel 2 even if other multiprocessors are still executing code from kernel 1? There is a big difference here.
The kernels run simultaneously on different multiprocessors assuming there are free resources.
For example if you launch one kernel with 5 blocks and another 10 blocks they should be able to both able to execute truly simultaneously. On the other hand if you launch one kernel with 2500 blocks and another with 1000 blocks the CUDA block scheduler should be intelligent enough to start the 2nd kernel while the 1st kernel is winding down.