Asynchronous kernel calls

I run code CUDA from

The kernels are called here:

dim3 Layer1_Block(6,1);

	dim3 Layer1_Thread(13,13);


	dim3 Layer2_Block(50,1);

	dim3 Layer2_Thread(5,5);

	executeSecondLayer<<<Layer2_Block,Layer2_Thread>>>(Layer2_Neurons_GPU, Layer2_Weights_GPU,Layer3_Neurons_GPU);

	dim3 Layer3_Block(100,1);

	dim3 Layer3_Thread(1,1);

	executeThirdLayer<<<Layer3_Block,Layer3_Thread>>>(Layer3_Neurons_GPU, Layer3_Weights_GPU,Layer4_Neurons_GPU);

	dim3 Layer4_Block(10,1);

	dim3 Layer4_Thread(1,1);


As you see, results from the N-th kernel call are used in n+1-th kernel call. There is no “cudaThreadSynchronize” call between the kernel calls, but everything always works correctly.

Why? Small kernel calls are synchronous? Or something else?

All kernel calls are synchronous with respect to the GPU. What happens in that code snippet is that the first kernel is launched, and then second and third are queued by the driver. The CPU is free to run asynchronously, but the GPU only runs a single kernel at any given time.

The kernel launch implicitly calls cudaThreadSynchronize so that a kernel will not start till the previous one has ended.


It does no such thing. That would imply the host threading owning the context sits in a spinlock until the kernel call finishes, which doesn’t happen. The driver maintains a queue, the kernel launch is queued and the host thread is released to run asynchronously. There is evidence that if the driver queue fills, the host thread will be held until a slot on the queue becomes free, but it seems you need to have queued a lot kernel launches (it might be as many as 64 in Cuda 2.3) before that happens.

Thanks, now I understand!