Thread communication using the shared memory within a single warp

zoku · May 2, 2024, 8:05am

Simple question: do I understand it correctly that a __threadfence_block call is enough to make sure changes made by a thread to the shared memory are visible to all other threads in the same warp? And there is no need to mark given locations in the shared memory as volatile?

This is a behaviour I seem to observe empirically but just wanted to make sure it is indeed the case as I couldn’t find a definitive confirmation in the docs.

Thank you.

Robert_Crovella · May 2, 2024, 2:05pm

it’s not a sync point, so my suggestion is the answer is no, not sufficient. the threadfence functions generally affect ordering of visibility, but don’t make guarantees about specific points at which the visibility is provided, since there is no implied synchronization.

You should use syncthreads variants for this purpose.

zoku · May 2, 2024, 2:47pm

Thank you for your reply.

Yes, I am aware that threadfence is not a sync point but keep in mind that my use case is about communication between threads in the same warp only. As far as I know threads from the same warp always run in sync.

My understanding is that threadfence causes the thread changing the shared memory to actually commit those changes into the memory. While such a commit does not guarantee immediate visibility for all other threads, my guess is it does guarantee visibility for all threads from the same warp as all threads in a warp share the L1 cache.

Of course the statement above is only a guess. If it is false, I am wondering what is an alternative for me. Syncthreads doesn’t seem to me a good alternative as I perform communication in places which are not reached by all threads in a block.

What about marking shared memory as volatile plus using threadfence. Would that be enough to guarantee changes visibility for threads in the same warp?

Robert_Crovella · May 2, 2024, 3:10pm

They don’t. You should discard that notion. It might have been conventional wisdom at some point in the CUDA history/trajectory, but current programming best practices indicate that programmers should no longer think this way. I suggest reading that blog I linked. It’s going to make the canonical suggestion (around listings 7 and 8) that to make intra-warp communication work, that you use __syncwarp() (at least). Syncwarp has all the necessary semantics: synchronization along with visibility guarantees. Warp shuffle is another option; modern warp shuffle has a built-in warp or mask level sync.

I personally don’t know how to do reliable inter-thread communication without any synchronization. Yes, you can use volatile for its defined functionality but without any thread ordering, there is no sense in which we can expect a write at a particular point in one thread to be read at another point, in another thread.

When I am teaching CUDA, I often suggest something like this to participants:

“Repeat after me: CUDA makes no kernel thread ordering guarantees, of any kind, except those that you explicitly enforce via source code.”

If your code requires thread ordering for correctness, and you don’t explicitly provide for it in your kernel source code, then your code is broken, by definition, regardless of what results it produces.

zoku · May 2, 2024, 7:49pm

Thank you for your replies - they are very informative.

It seems that syncwarp will perfectly satisfy all my needs - I didn’t even know such a call exists - thank you!

Do I get it right that syncwarp will make all shared memory changes immediately visible to all threads in a warp with no need to use the volatile qualifier?

Robert_Crovella · May 7, 2024, 10:34pm

Yes, as depicted in the blog I linked, there would typically not be a need to use volatile when using __syncwarp(), for the purpose of intra-warp communication.

system · May 21, 2024, 10:35pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Is __threadfence(); useful at all? CUDA Programming and Performance synchronization	4	1823	March 4, 2023
Is this the correct use of threadfence_block? CUDA Programming and Performance	4	6509	October 5, 2011
__syncthreads and __threadfence together in a loop CUDA Programming and Performance	5	3587	October 15, 2010
Synchronization, threadfence, random memory access beginner questions CUDA Programming and Performance	7	2635	April 9, 2012
difference between __threadfence_block and __syncthreads CUDA Programming and Performance	17	29271	April 22, 2015
__threadfence_block() vs __threadfence() ? CUDA Programming and Performance	6	6719	July 13, 2022
warp synchronization test CUDA Programming and Performance	5	1656	September 2, 2014
Question related __threadfence CUDA Programming and Performance	13	5075	January 12, 2016
Shared Memory and Read After Write CUDA Programming and Performance	2	1488	July 2, 2009
Problem with correct branching within a warp CUDA Programming and Performance	23	15642	May 28, 2009

Thread communication using the shared memory within a single warp

Related topics