Hi,
I have a pretty general question about global memory access and blocks in CUDA:
I have a kernel with 512 threads for each block and - lets say - 50 Blocks
Each block shall write a value to a certain address on global memory if and only if the value that is read at the adress is greater than the one that is to be written. I have here a reduced code snippet to make that clearer:
float global_value = global[index];
if (global_value>shared[threadIdx.x]) {
global[index] = shared[threadIdx.x];
}
So every block is writing out value only if the value that is inside the global adress is greater. So far so good. BUT … it does not work :(
It seems as if some blocks are overwriting the global memory results from other blocks with greater value what they should not do.
How can that happen? I thought blocks are being run sequentially an in order so that should not happen. Is there any global memory issue that I overlooked?
Thanks in advance
Julian