Threadfence or Atomic - propagating global writes


I have designed a list (array) adder deleter system. Its a list of work from where threads read work (delete) or add more work (add).
I have blocks of 32 threads (made it equal to warp size for now for simplicity).
When a block of threads need to read data from the list - ie pick work from worklist - One of my threads goes out and acquires a lock (using atomicCAS (spin-lock)) - Ones in the lock - it reads the head and tail of list - arbitrates if number of work available (tail-head) is more or less than number of threads (32 in our case - represented by one thread which has taken lock) requesting for work.

According to availability of elements in worklist I update the global head using a “simple” add - *head = *head+no_of_elements_being_read;
And then release the lock. Now my thread which had acquired the lock etc distributes the addresses to be read from global list to all threads on a shared array. Each thread then reads the index from where it has to read the global list FROM the read_from array (which is in shared mem) and accordingly reads data from global list.

NOW is it possible that the value of head that I updated might not have propagated before some other thread reads the head and does its calculation?

If it is possible - 1) is there no coherence being maintained in GPUs?? (maybe this has been told a lot and I havent really payed attention)
2) Should I be using threadfence or atomics for propagating my writes at the right time? - Which one of these might result in a better performance and resolve correctness issue?


Put a __threadfence() between writes to the list and releasing the lock. And make sure to declare your list as volatile so gets re-read on the consuming side.

Question - Does one single call of threadfence from one thread out of all the threads has the same effect?

I mean basically All my threads write while lock has been acquired by one thread of that block. Then that one thread releases the lock with an AtomicExch - so if I put a __threadfence just before it (that is - that part of the code is being executed by just one thread) - would that work?

Actually I am still having some races - and trying to figure out whats happening!

Thanks again


Also thanks for pointing out the fact that I must declare such variables (head, tail, list-elements (not really reused but still)) as VOLATILE.
I will post again after making these changes.


Since I am implementing a global worklist - I add to that worklist using the list tail and read (delete)from work list using Head of list.

None of my threads re-read any particular location of list however I do re-read Tail and Head values repeatedly. Hence I think there is no need to define the list itself as volatile?

I must however define global variable like tail and head as volatile?

In that case on the CPU code side - how do i use cudamemcpy or Atomics on them -? By type casting?