Hi, I am currently reading the material about faster reduce algorithms using shuffle, super interesting material, what I am curious about is the following, by reading the documentation I see that there is a third argument which defaults to the warpsize, meaning all the thread in the warp will participate in the shuffle. Does this mean there is an implicit synchronization there?
Lets give an example, before the shuffle I might have an if statements that makes my threads diverge a little, if I need all the thread in a warp to participate it mean the faster will need to wait for the
This is not giving me any trouble I just wish to understand the specifics of the __shfl instructions .
Finally the doc mention active and inactive threads, what does it mean exactly? I thought I could have active and not active warps, but is there the case it might happen that some threads are inactive inside a warp?