I’m curious if some of the problems I’m seeing in a branch of my code might stem from the fact that I am trying to “swap” memory from arrays in __shared__ to arrays in __global__ memory. The idea is to have the thread block loop over each index of the respective arrays and do the typical swap operation:
Update: I have implemented this solution for my main code, and it seems to fix the problem. I’m not sure if things are completely fine, but this definitely cleaned up some otherwise inexplicable pollution in my numbers. I did not end up needing the second __syncthreads() in the pseudo-code above, as the threads will move on to new indices of the arrays in subsequent iterations of the loop. In my actual code, I do have a __syncthreads() further down to catch the tail end of the writes to ensure that the swap is complete once it’s time to swap back.
Thank you, CUDA engineers! __syncthreads() and __syncwarp() memory barriers are truly impressive.
If each block operates on a separate gbl_array, you don’t need any synchronization because different threads access distinct array positions.
If multiple blocks have the same gbl_array, you have a race condition between blocks which cannot be fixed by __syncthreads()
What if I have different blocks operating on the same global array, but different sectors of it? Like, block 0 operates on elemetns 0 through 1023, block 1 operates on elements 1024 through 2048… I think that’s OK. In effect, the gbl_array pointer is unique to each block, although if a block overruns its bounds (which I am careful to prevent) then yes, __syncthreads() would not help me with that, I agree.
But I did seem to fix a lot of problems when I began adding the synchronization between reads from some location in global memory and writes to it, even though any given address is only operated upon by a single thread. Are you sure tha no synchronization should be needed?
Yes when all blocks are working with their own non-overlapping sections its fine. (It’s not obvious from the code snippet since all blocks would use the same pos when only threadIdx.x is used.
I am sure that if an address is only accessed by a single thread there cannot be a data race for this address. No synchronization required.
Yes, sorry I didn’t make the distinct sectors clear in my example. If this is true, then, I need to review what I’ve done, because I may have merely created a Heisenbug. It certainly went from “very bad, always bad” to “pretty quiet” as soon as I did what I describe above, and the new synchronization isn’t really much most costly, so far as I can tell, than the old one, so I don’t think I’ve changed the code in a way that would drastically tamp down on some other collision that’s happening. I will continue to investigate…
Have you run this code with compute-sanitizer? It can find many instances of race conditions, though not all of them. It also seems possible that you could have an out-of-bounds access somewhere which may not be obvious if it is an off-by-one error that does not trigger a memory access violation. Adding __sync_threads() may be merely masking the root cause of the observed data corruption in most cases, as you already suspected yourself.
@striker159@njuffa Thanks for insisting. Indeed, I just removed the excess synchronization, and the code continues to produce identical results after a 21-minute run. All that really happened is that the code finishes about 5% faster.