Does __shfl_*() contains an implicit sync?

Hi, I am currently reading the material about faster reduce algorithms using shuffle, super interesting material, what I am curious about is the following, by reading the documentation I see that there is a third argument which defaults to the warpsize, meaning all the thread in the warp will participate in the shuffle. Does this mean there is an implicit synchronization there?

Lets give an example, before the shuffle I might have an if statements that makes my threads diverge a little, if I need all the thread in a warp to participate it mean the faster will need to wait for the
divergent threads?

This is not giving me any trouble I just wish to understand the specifics of the __shfl instructions .

Finally the doc mention active and inactive threads, what does it mean exactly? I thought I could have active and not active warps, but is there the case it might happen that some threads are inactive inside a warp?

Best Regards

M.

There is no synchronization involved in a shuffle. Rather than waiting for those threads that follow a different control flow (the currently “inactive” threads), their values are simply undefined.

Thanks a lot! it makes sense, so basically “inactive” threads are threads that did not reach that point
at that given time due to different control flow? something like this :

if (value %2 == 0)
{
  doSomething();
}

_shfl_xor(...);

This snipped will yield an undefined behavior correct? because I have some divergent threads.
on the other hand something like this should be perfectly safe? (although more expensive):

if (value %2 == 0)
{
  doSomething();
}
__syncthreads();
_shfl_xor(...)

You’re asking good questions and followups. I hate to make this even more complicated, but there are multiple levels of answers. Your second code snippet is indeed the correct, safest, method. But as you imply, the full syncthreads() seems needlessly powerful and perhaps even unnecessary. That’s because every current NVIDIA GPU keeps all 32 threads in a single warp in lockstep already so there’s no need to sync when your dependencies are only warp-level. There are thousands of programs and libraries that depend on this behavior and even NVIDIA’s own examples used and recommended it.

But note I said “used” and not “use”, because the Official Word from NVIDIA is that to future-proof your code, you should no longer depend on such implicit warp synchronization. This is hinting that future GPUs may use a different architecture that breaks this behavior. So to be future proof today, you do indeed want to use the full-power _syncthreads() to be completely safe.

But that is super inelegant. NVIDIA realizes that too, and future versions of CUDA will have finer grained syncing methods which are both safe and (when possible) cost free. This NVIDIA GTC presentation talks about this exact problem (and upcoming solution) on pages 32-37.

In practice, my own GPU code still uses the implied warp synchronization everywhere. I try to COMMENT where it may run into problems in the future, but don’t actually use the _syncthreads().

When it comes to shuffle operations, I think the __syncthreads() can definitely be omitted. If the warp size would change in the future, the shuffle code is going to break anyway.

When it comes to using shared memory instead, I agree that it might be good style to use __syncthreads() in order to keep the code future-proof.

Back to the original question, I think the important misunderstanding is that diverging threads within one warp do not “wait for” the other ones to arrive. They all see the same instruction at the same time, but some of them may be inactive. As long as the shuffle is outside the conditional branch, everything should be fine.

Thank you for your replies guys, but I am still a little confused.
@audax
I understand that inside the same warp you get each thread execute the same instruction, but if i have threads going into a branch and threads not, what is happening there?
Threads that take the branch, like in my example and start doing the code of doSomething(). What about the threads that don’t take the branch? Do they start executing noOp instructions until the threads get out of the branch? Or the threads that don’t take the branch go straight to the __shlf_xor and the threads inside the branch executing the doSomething, appear as inactive to the _shfl instruction?

Do you mind to clarify a bit that behavior? ( I think that the term inactive here is throwing me a bit off).

giordi, your intuition is correct about divergence within a warp. Conceptually and abstractly you can think of some threads taking one branch, others taking a different branch. In actual implemention detail, all 32 the ALUs are participating in both branches, but the threads “not supposed to be in this branch” are no-ops on a per-lane, per-instruction basis by the use of the predicate bitmask. (and even that is an abstraction. I believe the real behavior is that all threads participate and compute, but the predicate prevents the results from being written back into the destination register, making it a no-op).

So an “inactive thread” in a warp is just one which has been predicated to be a no-op at instruction level granularity. Warp level divergence (where EVERY thread in the warp is skipping over some code) might be predicated (especially if it’s just an instruction or two) or just handled by control flow (ie, the warp’s instruction stream pointer is updated, like a CPU scheduler does).

This implementation detail of predicate masks is not necessary for you to know about on the CUDA C level, which is why it’s exposed and discussed in the lower level PTX documentation.

Intel has added very similar predicate masks in the new AVX2-512 instruction set since they’re so useful for SIMT style code.

Thank you guys, I mean it, you helped me tons, all your information plus this video (starting at 20 min)
https://www.youtube.com/watch?v=KHa-OSrZPGo&feature=youtu.be&t=1194

Really helped me nailing down the problem, in the end the warp is nothing more the size of the simd register, the thread instead (what in reality is a lane ) is nothing more of a “software” construct to give a bit of flexibilty to the programmer but it all boils down to a single massive register operation, if you get a branch it means you are just masking out the part of the register (aka making some threads inactive) etc. This really helped me understand the missing piece of the puzzle and to clarify the shuffle instruction.
Cheers

M.

@SPWorley, yes I saw the ptx stuff,in the do the kepler shuffle talk:
https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-kepler-shuffle/
They show an hand written ptx instruction to leverage instruction predication to get another extra boost. Super interesting stuff I plan to dive in, in the near future.