global memory consistency same address accessed by multiple blocks


I have a pretty general question about global memory access and blocks in CUDA:

I have a kernel with 512 threads for each block and - lets say - 50 Blocks

Each block shall write a value to a certain address on global memory if and only if the value that is read at the adress is greater than the one that is to be written. I have here a reduced code snippet to make that clearer:

float global_value = global[index];

if (global_value>shared[threadIdx.x]) {

  global[index] = shared[threadIdx.x];


So every block is writing out value only if the value that is inside the global adress is greater. So far so good. BUT … it does not work :(

It seems as if some blocks are overwriting the global memory results from other blocks with greater value what they should not do.

How can that happen? I thought blocks are being run sequentially an in order so that should not happen. Is there any global memory issue that I overlooked?

Thanks in advance


Block execution order is undefined in the execution model. Blocks can and are run in whatever order the driver chooses and it is almost never sequential. Any code that tries to rely on sequential block or thread execution ordering will fail. Your code should be safe as long as index is unique to every thread. If isn’t, then you have a classic memory race condition, and all sorts of read-after-write incoherencies will follow.