Thread synchronization across blocks


I understand that, __syncthreads() synchronizes all the threads in a block.

But, I want all the threads in all the blocks to be synchronized.
How do I do it ?


The only synchronization between all threads is finishing a kernel and starting the next kernel. There is no in-kernel mechanism for synchronizing threads.

Hello there,

I am agree the only synchronization between blocks is calling again the kernel but what about the cost to call a kernel?
I mean, i guess whenever it is done a kernel the control is taken by the CPU and then the kernel is call again in order to get the GPU. So, i can assume that is quite expensive in terms of performance, isn´t it? Anyone knows the cost to do a kernel call


It depends on the size of the grid being launched, but usually it is ~10-20 microseconds.

Thanks for your answer.
I have checked developing an empty kernel and launching the kernel several times.

The first time that it is launched it is 31 microseconds
and the other times it is taken around 7 microsenconds.
These values are obtained with blocks of 144 threads and 1 million of blocks.

Of course, depending of the grid size the times are changing but allways the first kernel launch is the biggest.

Why do you think is that happening?

To my understanding, the first kernel call involves with compiling the PTX into native codes, etc. The following calls of the same kernel does not need these, so they are faster. I don’t know how many kernels can be “cached” this way, though.