Synchronize Blocks Within CUDA kernel Your ipinion

Will NVIDIA be implementing a new method to synchronize all the blocks executing a kernel? Is it even possible with current hardware?

No and no.

The blocks belong to following three states at any instant,

  1. done, or

  2. execute on some SM, or

  3. wait because of no resources.

You cannot synchronize for ALL blocks. More precisely, you cannot synchronize for blocks of state 3.

But you can synchronize blocks of state 1 and 2.

I have an algorithm that requires read and write to global memory in an iterative process and need to sync blocks between each iteration. What would be the best way to do this? might __threadfence()?

Currently, I call the kernel from the host in each iteration:

for(…)
Cuda-kernel;

thanks in advance

Your method is the best method. You don’t want to use __threadfence().

Isnt there any overhead in this way or?