In the following code, what is lane 0’s level of participation in the shfl:
float someVar;
someVar = __shfl_up_sync(0xfffffffe, someVar, 1u);//Copy from a lane with lower ID relative to caller
- Source only
- Sink only
- No participation at all
Can I avoid if (0 == laneId) by using the mask somehow in the following, i.e. laneId 0 should be a source but not a sink in this transaction?
float someOtherVar;
float myVar = 0.0f;
float myVar = __shfl_up_sync(0xffffffff, someOtherVar, 1u);//Copy from a lane with lower ID relative to caller
if (0 == laneId)
myVar = 0.0f
You don’t have source/sink control via the mask. If you want a lane to be able to be used as a source lane, it must be named in the mask. That means it may also act as a sink.
If you read the docs you might conclude "aha! it only needs to be active, it does not need to be named in the mask. " (in order to act as a source lane)
This would be an incorrect interpretation based on this blog. The mask serves to guarantee a particular level of convergence. Without that, previous divergence could break things. (i.e. without it, there is no guarantee the thread will actively participate)
“aha! I have no previous divergence!”
Also not a valid statement based on the blog:
It assumes that threads in the same warp that are once synchronized will stay synchronized until the next thread-divergent branch. Although it is often true, it is not guaranteed in the CUDA programming model.
This is explained in the CUDA programming guide:
The source lane index will not wrap around the value of width, so effectively the lower delta lanes will be unchanged.
One has to be careful about ‘unchanged’. They would act as their own source?
Because the variable would be assigned a new value!
I think you can remove the second line in your code initializing with 0.
BTW: You can’t change the remaining code to a ternary operator, as the shuffle has to be executed.
From a performance point of view, if and assignments are fast, shuffle is slightly slower as it is a contested resource.
So the code may look uglier, but it should be no performance problem.
You can move it into an inline device function with return value or reference output parameters.