Quick question here - When you say threadfence doesnot effect behavior of OTHER threads - does it mean that a threadfence actually only makes the CALLING thread itself till its write is visible to everyone else and doesnot really make other threads to wait?
If that is the case then is there a way to actually implement a race free scheme - where in a group of threads (grpB) need read something that another grp (grpA) of threads write. Also before initializing there read sequence each thread of grpB is spinning on a variable(flag) which ONE of the grpA threads set. - hence grpB should wait for the write of grpA to be visible.
Does CUDA programming construct gives us such ability?
Depend if your threads are on the same warp (or half-warp it’s architecture dependent)! You need that all your grpA threads to be on the same half-warp to be safe, and same for grpB threads, but threads of grpA and grpB should not be in same Warp, but should be in the same block! Anyway it’s architecture-dependent, and would not engage you to think this way!
I would have done a quick simple Reduce to count the threads that have finished writing, to ensure they are all synchronized (especially if they are in different blocks).