Removing RAW race in global memory using __threadfence()


I am performing read-write accesses to an array in the global memory by threads from multiple thread-blocks. Accesses to array elements are protected by locks. Before releasing the lock, I call __threadfence to flush global memory writes by a thread to the L2 cache. I have also disabled L1 caching by using ‘-Xptxas -dlcm=cg’ flag during compilation.

GPU used is GTX 480.

I have stored the array addresses as 64-bit unsigned values and I cast them while accessing the corresponding global memory locations. In the following code, __threadfence() doesn’t seem to always push the writes before the lock is released. I observe that the updates to the global memory are sometimes not seen by threads from other thread-blocks. This behavior varies with each run.

unsigned long long addr = get_addr(tid);
      unsigned long long lock = get_lock(tid);

      bool done = false;
      while(!done) {
         if(atomicCAS((unsigned *)lock, 0, 1) == 0) {            
            unsigned data = *(unsigned *) addr;           
            unsigned new_data = process(data);
            *(unsigned *) addr = new_data;

            done = true;
            *(unsigned *) lock = 0;

I have tried declaring addr and lock variables as ‘volatile’. But it didn’t solve the problem. And since L1 caching is disabled anyway, there shouldn’t be any accesses to L1 cache returning stale data. So what could be wrong with this implementation?


Shouldn’t the atomic functions just work? Is there something extra you achieve with this code?

Atomic function guarantees protected accesses, but it doesn’t guarantee the order in which modifications made by a thread are seen by another. E.g., After thread A releases lock on line 13, thread B can see the lock released before new_data is written to the memory. __threadfence ensures that memory accesses made prior to __threadfence are complete before the lock is released. Fencing is required because GPU offers relaxed memory consistency without any coherence support.

The problem I am running into is exactly because of this issue. Another thread is seeing lock released before the earlier write is complete, despite having __threadfence call. And I have made sure that both threads are accessing same lock and global memory location.


I am a little confused here. Lets assume we have a counter ccc and we want to register some event. Then each thread will execute:
Doing this is sure that next thread which tries to increase the counter gets the correct value. I used this for histograms and I did not see any issue.

Yes, atomicAdd can work but it only updates one memory location. If you want to perform more complex operations atomically (read, analyze, write etc.), you need to implement a critical section as I have described in earlier post. For my application, arithmetic atomic operations alone (atomicAdd, atomicDec etc.) are not sufficient.

Can you post a complete example that reproduces the problem? It is hard to tell what some parts of the code are doing.

did you try the prototype from the programming guide?
This one worked for me for addition of doubles.

__device__ double atomicAdd(double* address, double val)
    unsigned long long int* address_as_ull =
                              (unsigned long long int*)address;
    unsigned long long int old = *address_as_ull, assumed;
    do {
        assumed = old;
        old = atomicCAS(address_as_ull, assumed,
                        __double_as_longlong(val +
    } while (assumed != old);
    return __longlong_as_double(old);