Block sheduling and L1 cache update ...about block synchronization

I know that thread synchronization between different block is not recommended in CUDA.

Somme people have try to do it with some success, has they said. Using global variable with atomic operation seen to be a successfull for some.

I have two question about that:

1- Has I undersantd, a block have to finish executing before leaving the SM to another block that is waiting for execute. Block are not swap out of a SM and then after some time swap in.
AND
Has I understand to, block execution order are pretty random. I mean that the first block (blockIdx.x=0) can be execute last and the last block can be execute first.

So the synchronisation can leed invariably to a deadlock if the number of SM is less then the number of block… whatever synchronisation method I try to use.

I am right about all that or not ???

2- If the number of block is less or equal to the number of SM. That’s mean no block will wait for another block to get a SM. And then deadlock can be avoid.

If I am using a global variable (in global memory) for synchronization. When reading that global variable from a threads of block A, that variable is going to be cached in L1 of the SM of block A. If that variable is modify by threads of block B that execute at the same time from another SM.
Is my L1 cache of SM of block A will produce a cache miss next time it is read ??? That cache miss will then produce a update, and then thread of block A will see what threads of block B have writen in the global variable.

Thank’s

As far as I can see your statements are correct: Blocks, once started, are not swapped out of an SM on current devices (although in the PTX documentation Nvidia reserves the right for doing this). So block synchronization with more blocks than able to run concurrently will indeed lead to deadlocks.

L1 caches are not coherent. [font=“Courier New”]__threadfence()[/font] can be used to force all dirty L1 cachelines of an SM to be written back.

According to my understanding, atomic functions don’t operate in L1 cache but directly in the memory controller however, since the very purpose of atomic operations is to produce consistent results when different threads (potentially on different SMs) access the same location in global memory. So [font=“Courier New”]__threadfence()[/font] should not be necessary in that case.

OK thank’s tera for your help…
Still not working like thaht in my code…maybe it’s something else…

Oh, I just wanted to point out the cases where inter-block communication does not work. I did not mean to imply it works in all other cases - there are quite a few traps and pitfalls when you decide to venture into undocumented lands.

inter-block communication is dangerous if you rely on variable “blockIdx” because execution order is at random.

However inter-block communication is possible if you can meet a condition:

every new block depends on those blocks which are running in some SMs now or are done.

Then you can use atomic operation to guaratee execution order.

i.e. a block with id blockIdx does not process a data block with id blockIdx.

You can use a global counter to take an id “data_bid” which corresponds to a data block.

int data_bid = atomic( counter ) ;

For example, I use this technique to do in-place data compression.

@tera

__threadfence() of threads in one SM cannot flush L1-cache in another SM.

Good catch, i didn’t see that my sentence could be misunderstood to imply this. I’ll edit the original post accordingly.