Correct use of _threadfence() to remove the RAW race Cannot remove race condition

sidxavier · April 7, 2012, 5:28pm

GPU used - Tesla 2070

I am managing a worklist -

work-producing threads write back to global list and Call __threadfence() so that this is visible to all threads and then one thread of the block updates the tail using an atomicAdd.
work-consuming threads read from the global list (the location from where they read is defined by current head and tail of list - I have double checked this and the head and tail values and indices which the threads use to read out of the global list are correct).
** I write out to the list first and then update the tail.

The problem I am facing is - despite using threadfence - the value that I am writing on the global list is sometimes not visible to other threads and thus they sometimes read junk.

To prove this - I printed out the whole worklist at the end of processing AND also print out the work(pair of double type numbers) that each consumer thread reads.

The worklist (printed at the end - all stores are finished) has no Junk value while the ‘work’ read by consumer thread has junk - which means that WHEN consumer thread read from the list at that point store had not completed.

The programming guide states the following -
void threadfence();
waits until all global and shared memory accesses made by the calling thread prior to
threadfence() are visible to:
ï± All threads in the thread block for shared memory accesses,
ï± All threads in the device for global memory accesses.

According to this I should not have had this issue after using __threadfence().

Please help me out here.
Sid

SPWorley · April 7, 2012, 5:36pm

Threadfence is not a barrier like syncthreads(), so you’re having a race condition.

Basically threadfence() makes the writing thread warp wait for its memory write to be fully committed, but a barrier will make all threads wait and synchronize, which is probably what you want.

If you post a little code (or pseudocode) it always help diagnose bugs.

sidxavier · April 7, 2012, 5:55pm

Its a huge code so I will try to summarize my code flow first -

Rough workflow -

-Each thread is in a continuous while loop - persistent thread

-There is a global worklist - on each threadblock iteration - one thread out of the block ((blocksize = 32) goes out - gets a global lock, calculates index (from which each thread in the block would read) from head,tail and ‘number of jobs to get’ and puts these indices on a shard array which all threads in the block use to read work from the global list. According to me I have been very careful with the locking, syncs and there should not be a problem here.

After doing the some JOB - each threads decides whether to add more work back to worklist or not. They update some shared flag array.
Again one single thread gets a lock (to manipulate head and tail of global list), calculates the number of things to be added back, calculates indices which would later be read by “writing threads” and used as position where they would write.
Then Each thread (if it has needs to) writes new work back to worklist.
I use a threadfence and then a syncthread right after this
After this, one single thread updates the tail by the number of work added to the list. Only after the tail is updated can any thread in any other block reads the new data.

So my requirement is that BEFORE THE TAIL GETS UPDATED WITH AN ATOMICADD MY WRITES ONTO THE GLOBAL WORKLIST SHOULD HAVE ALREADY COMMITTED.

if(is_myblock_busy!=0){

		if(threadIdx.x == 0)

		{

			// gets position where each thread would be writing back (shared_ldst_addr) - also acquires lock.  

                        // Returns number of new things that are going to be added to worklist 

			addTotail =	assign_WB_positions(lock,token,shared_ldst_addr,writeback_type,d_tail,fcount,icount );

		}

	}

__syncthreads();

		 if(writeback_type[threadIdx.x] == 3)

		{

			worklist1[shared_ldst_addr[threadIdx.x]] = temp_params2[0];

			worklist2[shared_ldst_addr[threadIdx.x]] = temp_params2[1];

		

		 __threadfence(); // I am assuming here that this threadfence called by the thread which is writing something on 

                                 // global array should make its write visible to all others.

	        }

__syncthreads(); // this should wait for all the threads in the block to write out their data and issue a threadfence.

	 if(threadIdx.x == 0 ) // one single thread updates the TAIL

		 {  

			 if(addTotail!=0)

			 {

				updateTail(addTotail,lock,d_tail,token);

			 }

          }

Thanks again

I appreciate your quick response.

Sid

sidxavier · April 7, 2012, 6:00pm

Also,

I only started to see this bug as I am optimizing the code - making it faster. Although, the optimization I have done are not related with worklist management part of the program and surely have not caused this bug. Which means that probably this bug has been there and I am only starting to see it as the code is running faster? Which implies a probable race and not a mistake in indexing etc.

Sid

pasoleatis · April 8, 2012, 12:14pm

Did you try to use this from the programming guide:

In general, when a thread issues a series of writes to memory in a particular order,

other threads may see the effects of these memory writes in a different order.

__threadfence_block(), __threadfence(), and

__threadfence_system() can be used to enforce some ordering.

One use case is when threads consume some data produced by other threads as

illustrated by the following code sample of a kernel that computes the sum of an

array of N numbers in one call. Each block first sums a subset of the array and

stores the result in global memory. When all blocks are done, the last block done

reads each of these partial sums from global memory and sums them to obtain the

final result. In order to determine which block is finished last, each block atomically

increments a counter to signal that it is done with computing and storing its partial

sum (see Section B.11 about atomic functions). The last block is the one that

receives the counter value equal to gridDim.x-1. If no fence is placed between

storing the partial sum and incrementing the counter, the counter might increment

before the partial sum is stored and therefore, might reach gridDim.x-1 and let

the last block start reading partial sums before they have been actually updated in

memory.

__device__ unsigned int count = 0;

__shared__ bool isLastBlockDone;

__global__ void sum(const float* array, unsigned int N,

float* result)

{

// Each block sums a subset of the input array

float partialSum = calculatePartialSum(array, N);

if (threadIdx.x == 0) {

// Thread 0 of each block stores the partial sum

// to global memory

result[blockIdx.x] = partialSum;

// Thread 0 makes sure its result is visible to

// all other threads

__threadfence();

// Thread 0 of each block signals that it is done

unsigned int value = atomicInc(&count, gridDim.x);

// Thread 0 of each block determines if its block is

// the last block to be done

isLastBlockDone = (value == (gridDim.x - 1));

}

// Synchronize to make sure that each thread reads

// the correct value of isLastBlockDone

__syncthreads();

if (isLastBlockDone) {

// The last block sums the partial sums

// stored in result[0 .. gridDim.x-1]

float totalSum = calculateTotalSum(result);

if (threadIdx.x == 0) {

// Thread 0 of last block stores total sum

// to global memory and resets count so that

// next kernel call works properly

result[0] = totalSum;

count = 0;

}

This code is for reduction like sum. Each block makes the partial sum writes the results to the global memory and the last block to be executed makes the final summation.

sidxavier · April 9, 2012, 5:25am

I think I am using the __threadfence in a similar way as the above example.

sidxavier · April 9, 2012, 6:00pm

So I made two implementations - according to my understanding both should have worked but both fail.

STYLE 1 - Doesnt work!

if(threadIdx.x ==0)

{

acquire_lock(lock); // ONE thread of the block locks the global list 

get_writeback_location(---); // the thread-0 updates a shared array  "shared_ldst_addr" which has the location where each thread 

                             //would write (if it needs to)

}

// any thread which needs to write 

if(writeback_type[threadIdx.x] == 3)

{

worklist1[shared_ldst_addr[threadIdx.x]] = temp_params2[0];

worklist2[shared_ldst_addr[threadIdx.x]] = temp_params2[1];

__threadfence(); // I am assuming here that this threadfence called by the thread which is writing something on global array 

                 //should make its write visible to all others.

}

__syncthreads(); // this should wait for all the threads in the block to write out their data and issue a threadfence.

if(threadIdx.x == 0 ) // one single thread updates the TAIL

{ 

updateTail(addTotail,lock,d_tail,token); // update the tail now -- BUT NOT WE ALREADY HAD A syncthread that means all the 

                                         //threadfence instructions should be executed by now so All writes must be visible to 

                                         //threads according to threadfence description. 

release_lock(lock);

}

Style 2 - Doesnt work again.

// Here basically only one thread writes out all the data that any thread in the block needs to write and then calls threadfence 

//and then updates tail and releases the lock. Again according to threadfence - the new store of tail (update) should only be 

//visible after all the other stuff that the thread wrote before it is visible. But that doesnt happen apparently.

Some one really needs to help me out here.

Sid.

tera · April 10, 2012, 10:52am

How does the consuming side of the code look like - do you have a [font=“Courier New”]__syncthreads()[/font] in there as well to ensure all data is (re-)read from global memory?

sidxavier · April 11, 2012, 3:24am

Yes to my levels I have been very careful about synchronizations everywhere. I do have syncthreads on the consumer side.

I have crossed checked my code and nothing looks wrong. I am now paranoid about my understanding of __threadfence and how to actually use it and if it even works.

Also I do head/tail manipulation and related computation (calculating indices for other threads for their read/write) In a lock.

Do you guys want me to post the whole code?

Sid

tera · April 11, 2012, 12:13pm

I cannot make any guarantees, but that’s probably the way forward.

David_H · April 17, 2012, 10:39am

Is your worklist declared as volatile?

sidxavier · April 19, 2012, 7:18am

That actually fixed my problem!! Should have been careful! I was using volatile for head and tail but somehow missed the obvious fact that worklist’s values itself are being written by someone on main memory and that their values in the cache are ‘old/stale’.

The fact that I am using so much cache due to register spills made the matters worse - because - it almost always evicted the worklist value from the cache and thus the requesting thread was forced to read it directly from the main memory (i guess) almost always which meant the thread was getting the true value.

But on those rare occasions when there was actually a cache hit - the thread would get incorrect value. So the low frequency of occurrence led me to believe that there was a race of some kind - which was not the case.

It was just a simple case of cache not being coherent on gpus and Me the programmer not taking care of it properly. :)

Thanks a lot David for drawing my attention to it…

I had actually kinda made a make do fix for my problem with a ‘madeup’ replay read method - which was identifying if the thread read a junk value and smartly replaying the read at a later stage – this would have been useful if there was no __threadfence provided by Nvidia :)

sidxavier · April 20, 2012, 5:12am

I have a quick question regarding making things volatile.

The way I understand volatile is – making something volatile forces the compiler to not look for that value in register - but instead issue a Load for it.
This load in general goes to the memory hierarchy. For my thing to work - a volatile must force the read/load to happen from Main memory instead of cache OR register - so cache read must also be avoided.

Does making something volatile ensure this?

tera · April 20, 2012, 8:23am

Yes.
For volatile variables the compiler issues load instructions with a special .cv modifier that invalidates the cache entry and refetches data from global memory. See section 8.7.6.1 of the PTX manual.

sidxavier · April 23, 2012, 9:11am

Thanks guys.
You have been very helpful.

I also wanted to draw your attention to another question that I have - refer to this thread -

Thanks again
Sid.

Topic		Replies	Views
Threadfence or Atomic - propagating global writes CUDA Programming and Performance	4	1714	March 15, 2012
Removing RAW race in global memory using __threadfence() CUDA Programming and Performance	6	1035	July 26, 2013
Synchronization, threadfence, random memory access beginner questions CUDA Programming and Performance	7	2659	April 9, 2012
Doubt on __threadfence() require a detail description of this function. CUDA Programming and Performance	5	2940	January 25, 2010
Problems with __threadfence CUDA Programming and Performance	2	3122	November 11, 2009
Global memory coherence in compute capability 2.0 Does __threadfence() really do what's on the t CUDA Programming and Performance	1	3707	April 11, 2012
__threadfence() problem CUDA Programming and Performance	2	9468	January 11, 2011
Do I need threadfence? CUDA Programming and Performance	4	1742	April 11, 2012
Question related __threadfence CUDA Programming and Performance	13	5103	January 12, 2016
__syncthreads and __threadfence together in a loop CUDA Programming and Performance	5	3603	October 15, 2010

Correct use of _threadfence() to remove the RAW race Cannot remove race condition

Related topics