Synchronization between Kernel calls

Is the invocation of a kernel synchronous?

For example, I have this program:

kernel_1 <<<...>>> (...);

for (i = 0; i < k; i++) {

   kernel_2 <<<...>>> (...);

   kernel_3 <<<...>>> (...);

}

Each kernel call must be done just when all the previous kernel’s threads are done.

Some threads may take more time than others.

I tried using “cudaDeviceSynchronize()” after all the kernel calls, but it didn’t work.

I’m, getting wrong random results in my calculation.

Kernels in the same stream are invoked sequentially, so you don’t need any extra synchronization in this example. If your kernel returns unexpected results, it must be for some other reason.

While kernel calls are asynchronous, they are so with respect to the CPU. As tera said, threads in the same stream execute in order in FIFO fashion.