Correct use of _threadfence() to remove the RAW race Cannot remove race condition

GPU used - Tesla 2070

I am managing a worklist -

  • work-producing threads write back to global list and Call __threadfence() so that this is visible to all threads and then one thread of the block updates the tail using an atomicAdd.
  • work-consuming threads read from the global list (the location from where they read is defined by current head and tail of list - I have double checked this and the head and tail values and indices which the threads use to read out of the global list are correct).
    ** I write out to the list first and then update the tail.

The problem I am facing is - despite using threadfence - the value that I am writing on the global list is sometimes not visible to other threads and thus they sometimes read junk.

To prove this - I printed out the whole worklist at the end of processing AND also print out the work(pair of double type numbers) that each consumer thread reads.

The worklist (printed at the end - all stores are finished) has no Junk value while the ‘work’ read by consumer thread has junk - which means that WHEN consumer thread read from the list at that point store had not completed.

The programming guide states the following -
void __threadfence();
waits until all global and shared memory accesses made by the calling thread prior to
__threadfence() are visible to:
 All threads in the thread block for shared memory accesses,
 All threads in the device for global memory accesses.

According to this I should not have had this issue after using __threadfence().

Please help me out here.

Threadfence is not a barrier like syncthreads(), so you’re having a race condition.

Basically threadfence() makes the writing thread warp wait for its memory write to be fully committed, but a barrier will make all threads wait and synchronize, which is probably what you want.

If you post a little code (or pseudocode) it always help diagnose bugs.

Its a huge code so I will try to summarize my code flow first -

Rough workflow -

-Each thread is in a continuous while loop - persistent thread

-There is a global worklist - on each threadblock iteration - one thread out of the block ((blocksize = 32) goes out - gets a global lock, calculates index (from which each thread in the block would read) from head,tail and ‘number of jobs to get’ and puts these indices on a shard array which all threads in the block use to read work from the global list. According to me I have been very careful with the locking, syncs and there should not be a problem here.

  • After doing the some JOB - each threads decides whether to add more work back to worklist or not. They update some shared flag array.

  • Again one single thread gets a lock (to manipulate head and tail of global list), calculates the number of things to be added back, calculates indices which would later be read by “writing threads” and used as position where they would write.

  • Then Each thread (if it has needs to) writes new work back to worklist.

  • I use a threadfence and then a syncthread right after this

  • After this, one single thread updates the tail by the number of work added to the list. Only after the tail is updated can any thread in any other block reads the new data.



		if(threadIdx.x == 0)


			// gets position where each thread would be writing back (shared_ldst_addr) - also acquires lock.  

                        // Returns number of new things that are going to be added to worklist 

			addTotail =	assign_WB_positions(lock,token,shared_ldst_addr,writeback_type,d_tail,fcount,icount );




		 if(writeback_type[threadIdx.x] == 3)


			worklist1[shared_ldst_addr[threadIdx.x]] = temp_params2[0];

			worklist2[shared_ldst_addr[threadIdx.x]] = temp_params2[1];


		 __threadfence(); // I am assuming here that this threadfence called by the thread which is writing something on 

                                 // global array should make its write visible to all others.


__syncthreads(); // this should wait for all the threads in the block to write out their data and issue a threadfence.

	 if(threadIdx.x == 0 ) // one single thread updates the TAIL







Thanks again

I appreciate your quick response.



I only started to see this bug as I am optimizing the code - making it faster. Although, the optimization I have done are not related with worklist management part of the program and surely have not caused this bug. Which means that probably this bug has been there and I am only starting to see it as the code is running faster? Which implies a probable race and not a mistake in indexing etc.


Did you try to use this from the programming guide:

__device__ unsigned int count = 0;

__shared__ bool isLastBlockDone;

__global__ void sum(const float* array, unsigned int N,

float* result)


// Each block sums a subset of the input array

float partialSum = calculatePartialSum(array, N);

if (threadIdx.x == 0) {

// Thread 0 of each block stores the partial sum

// to global memory

result[blockIdx.x] = partialSum;

// Thread 0 makes sure its result is visible to

// all other threads


// Thread 0 of each block signals that it is done

unsigned int value = atomicInc(&count, gridDim.x);

// Thread 0 of each block determines if its block is

// the last block to be done

isLastBlockDone = (value == (gridDim.x - 1));


// Synchronize to make sure that each thread reads

// the correct value of isLastBlockDone


if (isLastBlockDone) {

// The last block sums the partial sums

// stored in result[0 .. gridDim.x-1]

float totalSum = calculateTotalSum(result);

if (threadIdx.x == 0) {

// Thread 0 of last block stores total sum

// to global memory and resets count so that

// next kernel call works properly

result[0] = totalSum;

count = 0;


This code is for reduction like sum. Each block makes the partial sum writes the results to the global memory and the last block to be executed makes the final summation.

I think I am using the __threadfence in a similar way as the above example.

So I made two implementations - according to my understanding both should have worked but both fail.

STYLE 1 - Doesnt work!

if(threadIdx.x ==0)


acquire_lock(lock); // ONE thread of the block locks the global list 

get_writeback_location(---); // the thread-0 updates a shared array  "shared_ldst_addr" which has the location where each thread 

                             //would write (if it needs to)


// any thread which needs to write 

if(writeback_type[threadIdx.x] == 3)


worklist1[shared_ldst_addr[threadIdx.x]] = temp_params2[0];

worklist2[shared_ldst_addr[threadIdx.x]] = temp_params2[1];

__threadfence(); // I am assuming here that this threadfence called by the thread which is writing something on global array 

                 //should make its write visible to all others.


__syncthreads(); // this should wait for all the threads in the block to write out their data and issue a threadfence.

if(threadIdx.x == 0 ) // one single thread updates the TAIL


updateTail(addTotail,lock,d_tail,token); // update the tail now -- BUT NOT WE ALREADY HAD A syncthread that means all the 

                                         //threadfence instructions should be executed by now so All writes must be visible to 

                                         //threads according to threadfence description. 



Style 2 - Doesnt work again.

// Here basically only one thread writes out all the data that any thread in the block needs to write and then calls threadfence 

//and then updates tail and releases the lock. Again according to threadfence - the new store of tail (update) should only be 

//visible after all the other stuff that the thread wrote before it is visible. But that doesnt happen apparently.

Some one really needs to help me out here.


How does the consuming side of the code look like - do you have a [font=“Courier New”]__syncthreads()[/font] in there as well to ensure all data is (re-)read from global memory?

Yes to my levels I have been very careful about synchronizations everywhere. I do have syncthreads on the consumer side.

I have crossed checked my code and nothing looks wrong. I am now paranoid about my understanding of __threadfence and how to actually use it and if it even works.

Also I do head/tail manipulation and related computation (calculating indices for other threads for their read/write) In a lock.

Do you guys want me to post the whole code?


I cannot make any guarantees, but that’s probably the way forward.

Is your worklist declared as volatile?

That actually fixed my problem!! Should have been careful! I was using volatile for head and tail but somehow missed the obvious fact that worklist’s values itself are being written by someone on main memory and that their values in the cache are ‘old/stale’.

The fact that I am using so much cache due to register spills made the matters worse - because - it almost always evicted the worklist value from the cache and thus the requesting thread was forced to read it directly from the main memory (i guess) almost always which meant the thread was getting the true value.

But on those rare occasions when there was actually a cache hit - the thread would get incorrect value. So the low frequency of occurrence led me to believe that there was a race of some kind - which was not the case.

It was just a simple case of cache not being coherent on gpus and Me the programmer not taking care of it properly. :)

Thanks a lot David for drawing my attention to it…

I had actually kinda made a make do fix for my problem with a ‘madeup’ replay read method - which was identifying if the thread read a junk value and smartly replaying the read at a later stage – this would have been useful if there was no __threadfence provided by Nvidia :)

I have a quick question regarding making things volatile.

The way I understand volatile is – making something volatile forces the compiler to not look for that value in register - but instead issue a Load for it.
This load in general goes to the memory hierarchy. For my thing to work - a volatile must force the read/load to happen from Main memory instead of cache OR register - so cache read must also be avoided.

Does making something volatile ensure this?

For volatile variables the compiler issues load instructions with a special .cv modifier that invalidates the cache entry and refetches data from global memory. See section of the PTX manual.

Thanks guys.
You have been very helpful.

I also wanted to draw your attention to another question that I have - refer to this thread -

Thanks again