I know that thread synchronization between different block is not recommended in CUDA.
Somme people have try to do it with some success, has they said. Using global variable with atomic operation seen to be a successfull for some.
I have two question about that:
1- Has I undersantd, a block have to finish executing before leaving the SM to another block that is waiting for execute. Block are not swap out of a SM and then after some time swap in.
Has I understand to, block execution order are pretty random. I mean that the first block (blockIdx.x=0) can be execute last and the last block can be execute first.
So the synchronisation can leed invariably to a deadlock if the number of SM is less then the number of block… whatever synchronisation method I try to use.
I am right about all that or not ???
2- If the number of block is less or equal to the number of SM. That’s mean no block will wait for another block to get a SM. And then deadlock can be avoid.
If I am using a global variable (in global memory) for synchronization. When reading that global variable from a threads of block A, that variable is going to be cached in L1 of the SM of block A. If that variable is modify by threads of block B that execute at the same time from another SM.
Is my L1 cache of SM of block A will produce a cache miss next time it is read ??? That cache miss will then produce a update, and then thread of block A will see what threads of block B have writen in the global variable.