I have designed a list (array) adder deleter system. Its a list of work from where threads read work (delete) or add more work (add).
I have blocks of 32 threads (made it equal to warp size for now for simplicity).
When a block of threads need to read data from the list - ie pick work from worklist - One of my threads goes out and acquires a lock (using atomicCAS (spin-lock)) - Ones in the lock - it reads the head and tail of list - arbitrates if number of work available (tail-head) is more or less than number of threads (32 in our case - represented by one thread which has taken lock) requesting for work.
According to availability of elements in worklist I update the global head using a “simple” add - *head = *head+no_of_elements_being_read;
And then release the lock. Now my thread which had acquired the lock etc distributes the addresses to be read from global list to all threads on a shared array. Each thread then reads the index from where it has to read the global list FROM the read_from array (which is in shared mem) and accordingly reads data from global list.
NOW is it possible that the value of head that I updated might not have propagated before some other thread reads the head and does its calculation?
If it is possible - 1) is there no coherence being maintained in GPUs?? (maybe this has been told a lot and I havent really payed attention)
2) Should I be using threadfence or atomics for propagating my writes at the right time? - Which one of these might result in a better performance and resolve correctness issue?