I’m a bit confused.
Kernel launches are asynchronous, right? So after a kernel has been launched, CPU code runs without waiting for the kernel to finish.
But kernels can’t run in parallel…So if I having the following code:
kernelOne<<<1,1>>>();
kernelTwo<<<1,1>>>();
printf("Kernels are done!\n");
kernelTwo will block? Or will it run the printf, but inside the GPU, kernelTwo is waiting for kernelOne to finish?