Every thread add to the same shared memory at once?

dscerutti · May 30, 2017, 10:12am

I have noticed that I can write code like this:

__shared__ int V;

// The following is executed by precisely one warp, not an entire block,
// and warptx is the thread index within the warp.
if (warptx == 0) {
  V = 5;
}
V += 8;

Coming out of that, V is going to have the value of 13, but I feel like what just happened is that every thread in the warp just read V as 5, added 8 to it, and then committed back 13. There were 32 reads on the same shared space and then 32 writes. It WORKS because of warp synchronous programming, but I think it’s not efficient. Am I wrong–does the compiler see that and realize “oh, hey, this idiot is trying to add 8 to this piece of shared once, so I’ll kindly put if (warptx == 0) { … } around it to reduce the number of pulls on the one shared memory bank?”

Thanks!

BulatZiganshin · May 30, 2017, 10:40am

first, V+=8 compiled to sequence of operations:

reg1 := mem1
reg1 += 8
mem1 := reg1

first operation is a sort of broadcast (just discussed). second one is is lane-local (well, constant 8 is loaded from constant memory and also broadcasted)

only last operation is really interesting. officially, it’s “all lanes perfrom the write and one value is randomly choosen”. in real hardware, it’s either first or last lane, i don’t remember. some programs even employ this implementation detail, although of course it’s very bad practice

BulatZiganshin · May 30, 2017, 4:09pm

forgot to say: afaik, you code example may not work the way you expected. the reason is that officially, each thread is independent - and program optimization relies on that computation model. so, it can be optimized f.e. to

if (warptx==0) v=13;
else v=42; /*trash in - trash out */

to ensure that it will compile to what you mean, you can use one of:

“volatile shared int V;” - volatile specification forces a sort of memory barrier around each variable access
thread barier that is a part of __syncthreads() call
thread barier in a form of __threadfence_block()

dscerutti · May 30, 2017, 8:11pm

Your explanation is much appreciated. I tested the code with that __shfl() broadcasting that I had mentioned and indeed, things ran slower than before. The code I posted is merely exemplary, so as long as it got the point across there’s no need to debug it but thanks in any case for your other considerations.

Topic		Replies	Views
Shared memory issue CUDA Programming and Performance	4	737	July 18, 2013
Deliberate race condition CUDA Programming and Performance	4	23	January 14, 2025
CUDA 8 RC code generation related to warps updating __shared__ memory CUDA Programming and Performance	6	1463	August 20, 2016
Warp synchronous programming CUDA Programming and Performance	4	1721	January 30, 2015
shared memory intra-warp conflicts summing into shared memory, how? CUDA Programming and Performance	2	2790	September 5, 2009
Best way to pack bits into words for global memory Better than reduce in shared memory? CUDA Programming and Performance	17	6679	June 2, 2012
Using shared memory where a variable number of threads shares some data. CUDA Programming and Performance	3	4310	May 14, 2011
shared memory and threads question CUDA Programming and Performance	9	5930	August 30, 2007
Doesn't this write to the same thread? CUDA Programming and Performance	7	1195	October 2, 2014
warp synchronization test CUDA Programming and Performance	5	1656	September 2, 2014

Every thread add to the same __shared__ memory at once?

Related topics

Every thread add to the same shared memory at once?