Simple question: do I understand it correctly that a __threadfence_block call is enough to make sure changes made by a thread to the shared memory are visible to all other threads in the same warp? And there is no need to mark given locations in the shared memory as volatile?
This is a behaviour I seem to observe empirically but just wanted to make sure it is indeed the case as I couldn’t find a definitive confirmation in the docs.
it’s not a sync point, so my suggestion is the answer is no, not sufficient. the threadfence functions generally affect ordering of visibility, but don’t make guarantees about specific points at which the visibility is provided, since there is no implied synchronization.
You should use syncthreads variants for this purpose.
Yes, I am aware that threadfence is not a sync point but keep in mind that my use case is about communication between threads in the same warp only. As far as I know threads from the same warp always run in sync.
My understanding is that threadfence causes the thread changing the shared memory to actually commit those changes into the memory. While such a commit does not guarantee immediate visibility for all other threads, my guess is it does guarantee visibility for all threads from the same warp as all threads in a warp share the L1 cache.
Of course the statement above is only a guess. If it is false, I am wondering what is an alternative for me. Syncthreads doesn’t seem to me a good alternative as I perform communication in places which are not reached by all threads in a block.
What about marking shared memory as volatile plus using threadfence. Would that be enough to guarantee changes visibility for threads in the same warp?
They don’t. You should discard that notion. It might have been conventional wisdom at some point in the CUDA history/trajectory, but current programming best practices indicate that programmers should no longer think this way. I suggest reading that blog I linked. It’s going to make the canonical suggestion (around listings 7 and 8) that to make intra-warp communication work, that you use __syncwarp() (at least). Syncwarp has all the necessary semantics: synchronization along with visibility guarantees. Warp shuffle is another option; modern warp shuffle has a built-in warp or mask level sync.
I personally don’t know how to do reliable inter-thread communication without any synchronization. Yes, you can use volatile for its defined functionality but without any thread ordering, there is no sense in which we can expect a write at a particular point in one thread to be read at another point, in another thread.
When I am teaching CUDA, I often suggest something like this to participants:
“Repeat after me: CUDA makes no kernel thread ordering guarantees, of any kind, except those that you explicitly enforce via source code.”
If your code requires thread ordering for correctness, and you don’t explicitly provide for it in your kernel source code, then your code is broken, by definition, regardless of what results it produces.
Thank you for your replies - they are very informative.
It seems that syncwarp will perfectly satisfy all my needs - I didn’t even know such a call exists - thank you!
Do I get it right that syncwarp will make all shared memory changes immediately visible to all threads in a warp with no need to use the volatile qualifier?
Yes, as depicted in the blog I linked, there would typically not be a need to use volatile when using __syncwarp(), for the purpose of intra-warp communication.