Removing RAW race in global memory using __threadfence()

Antariksh · July 18, 2013, 11:49pm

Hi,

I am performing read-write accesses to an array in the global memory by threads from multiple thread-blocks. Accesses to array elements are protected by locks. Before releasing the lock, I call __threadfence to flush global memory writes by a thread to the L2 cache. I have also disabled L1 caching by using ‘-Xptxas -dlcm=cg’ flag during compilation.

GPU used is GTX 480.

I have stored the array addresses as 64-bit unsigned values and I cast them while accessing the corresponding global memory locations. In the following code, __threadfence() doesn’t seem to always push the writes before the lock is released. I observe that the updates to the global memory are sometimes not seen by threads from other thread-blocks. This behavior varies with each run.

unsigned long long addr = get_addr(tid);
      unsigned long long lock = get_lock(tid);

      bool done = false;
      while(!done) {
         if(atomicCAS((unsigned *)lock, 0, 1) == 0) {            
            unsigned data = *(unsigned *) addr;           
            unsigned new_data = process(data);
            *(unsigned *) addr = new_data;

            __threadfence();
            done = true;
            *(unsigned *) lock = 0;
         }
      }

I have tried declaring addr and lock variables as ‘volatile’. But it didn’t solve the problem. And since L1 caching is disabled anyway, there shouldn’t be any accesses to L1 cache returning stale data. So what could be wrong with this implementation?

Thanks!

pasoleatis · July 19, 2013, 1:07pm

Shouldn’t the atomic functions just work? Is there something extra you achieve with this code?

Antariksh · July 19, 2013, 2:52pm

Atomic function guarantees protected accesses, but it doesn’t guarantee the order in which modifications made by a thread are seen by another. E.g., After thread A releases lock on line 13, thread B can see the lock released before new_data is written to the memory. __threadfence ensures that memory accesses made prior to __threadfence are complete before the lock is released. Fencing is required because GPU offers relaxed memory consistency without any coherence support.

The problem I am running into is exactly because of this issue. Another thread is seeing lock released before the earlier write is complete, despite having __threadfence call. And I have made sure that both threads are accessing same lock and global memory location.

pasoleatis · July 20, 2013, 7:26pm

Hello,

I am a little confused here. Lets assume we have a counter ccc and we want to register some event. Then each thread will execute:
atomicAdd(&ccc,1);
Doing this is sure that next thread which tries to increase the counter gets the correct value. I used this for histograms and I did not see any issue.

Antariksh · July 22, 2013, 3:29pm

Yes, atomicAdd can work but it only updates one memory location. If you want to perform more complex operations atomically (read, analyze, write etc.), you need to implement a critical section as I have described in earlier post. For my application, arithmetic atomic operations alone (atomicAdd, atomicDec etc.) are not sufficient.

Gregory_Diamos · July 25, 2013, 11:29pm

Can you post a complete example that reproduces the problem? It is hard to tell what some parts of the code are doing.

pasoleatis · July 26, 2013, 11:27am

did you try the prototype from the programming guide?
This one worked for me for addition of doubles.

__device__ double atomicAdd(double* address, double val)
{
    unsigned long long int* address_as_ull =
                              (unsigned long long int*)address;
    unsigned long long int old = *address_as_ull, assumed;
    do {
        assumed = old;
        old = atomicCAS(address_as_ull, assumed,
                        __double_as_longlong(val +
                               __longlong_as_double(assumed)));
    } while (assumed != old);
    return __longlong_as_double(old);
}

Topic		Replies	Views
Correct use of _threadfence() to remove the RAW race Cannot remove race condition CUDA Programming and Performance	14	3801	April 23, 2012
Global memory coherence in compute capability 2.0 Does __threadfence() really do what's on the t CUDA Programming and Performance	1	3716	April 11, 2012
__threadfence() problem CUDA Programming and Performance	2	9478	January 11, 2011
Do I need threadfence? CUDA Programming and Performance	4	1750	April 11, 2012
Variable Number of Results CUDA Programming and Performance	3	1705	April 10, 2009
Threadfence or Atomic - propagating global writes CUDA Programming and Performance	4	1721	March 15, 2012
Problems with __threadfence CUDA Programming and Performance	2	3135	November 11, 2009
__threadfence() Can I use it for ... CUDA Programming and Performance	9	7181	June 12, 2009
Question related __threadfence CUDA Programming and Performance	13	5117	January 12, 2016
Doubt on __threadfence() require a detail description of this function. CUDA Programming and Performance	5	2952	January 25, 2010

Removing RAW race in global memory using __threadfence()

Related topics