Every thread add to the same __shared__ memory at once?

I have noticed that I can write code like this:

__shared__ int V;

// The following is executed by precisely one warp, not an entire block,
// and warptx is the thread index within the warp.
if (warptx == 0) {
  V = 5;
}
V += 8;

Coming out of that, V is going to have the value of 13, but I feel like what just happened is that every thread in the warp just read V as 5, added 8 to it, and then committed back 13. There were 32 reads on the same shared space and then 32 writes. It WORKS because of warp synchronous programming, but I think it’s not efficient. Am I wrong–does the compiler see that and realize “oh, hey, this idiot is trying to add 8 to this piece of shared once, so I’ll kindly put if (warptx == 0) { … } around it to reduce the number of pulls on the one shared memory bank?”

Thanks!

first, V+=8 compiled to sequence of operations:

reg1 := mem1
reg1 += 8
mem1 := reg1

first operation is a sort of broadcast (just discussed). second one is is lane-local (well, constant 8 is loaded from constant memory and also broadcasted)

only last operation is really interesting. officially, it’s “all lanes perfrom the write and one value is randomly choosen”. in real hardware, it’s either first or last lane, i don’t remember. some programs even employ this implementation detail, although of course it’s very bad practice

forgot to say: afaik, you code example may not work the way you expected. the reason is that officially, each thread is independent - and program optimization relies on that computation model. so, it can be optimized f.e. to

if (warptx==0) v=13;
else v=42; /*trash in - trash out */

to ensure that it will compile to what you mean, you can use one of:

  1. “volatile shared int V;” - volatile specification forces a sort of memory barrier around each variable access
  2. thread barier that is a part of __syncthreads() call
  3. thread barier in a form of __threadfence_block()

Your explanation is much appreciated. I tested the code with that __shfl() broadcasting that I had mentioned and indeed, things ran slower than before. The code I posted is merely exemplary, so as long as it got the point across there’s no need to debug it but thanks in any case for your other considerations.