As you see, results from the N-th kernel call are used in n+1-th kernel call. There is no “cudaThreadSynchronize” call between the kernel calls, but everything always works correctly.
Why? Small kernel calls are synchronous? Or something else?
All kernel calls are synchronous with respect to the GPU. What happens in that code snippet is that the first kernel is launched, and then second and third are queued by the driver. The CPU is free to run asynchronously, but the GPU only runs a single kernel at any given time.
It does no such thing. That would imply the host threading owning the context sits in a spinlock until the kernel call finishes, which doesn’t happen. The driver maintains a queue, the kernel launch is queued and the host thread is released to run asynchronously. There is evidence that if the driver queue fills, the host thread will be held until a slot on the queue becomes free, but it seems you need to have queued a lot kernel launches (it might be as many as 64 in Cuda 2.3) before that happens.