__shfl_up_sync() mask semantics

Vectorizer · October 26, 2025, 7:57pm

In the following code, what is lane 0’s level of participation in the shfl:

float someVar;
someVar = __shfl_up_sync(0xfffffffe, someVar, 1u);//Copy from a lane with lower ID relative to caller

Source only
Sink only
No participation at all

Can I avoid if (0 == laneId) by using the mask somehow in the following, i.e. laneId 0 should be a source but not a sink in this transaction?

float someOtherVar;
float myVar = 0.0f;
float myVar = __shfl_up_sync(0xffffffff, someOtherVar, 1u);//Copy from a lane with lower ID relative to caller
if (0 == laneId)
    myVar = 0.0f

Robert_Crovella · October 26, 2025, 10:55pm

You don’t have source/sink control via the mask. If you want a lane to be able to be used as a source lane, it must be named in the mask. That means it may also act as a sink.

If you read the docs you might conclude "aha! it only needs to be active, it does not need to be named in the mask. " (in order to act as a source lane)

This would be an incorrect interpretation based on this blog. The mask serves to guarantee a particular level of convergence. Without that, previous divergence could break things. (i.e. without it, there is no guarantee the thread will actively participate)

“aha! I have no previous divergence!”

Also not a valid statement based on the blog:

It assumes that threads in the same warp that are once synchronized will stay synchronized until the next thread-divergent branch. Although it is often true, it is not guaranteed in the CUDA programming model.

striker159 · October 27, 2025, 6:04am

This is explained in the CUDA programming guide:

The source lane index will not wrap around the value of width, so effectively the lower delta lanes will be unchanged.

Curefab · October 27, 2025, 1:43pm

One has to be careful about ‘unchanged’. They would act as their own source?

Because the variable would be assigned a new value!

Curefab · October 27, 2025, 1:45pm

I think you can remove the second line in your code initializing with 0.

BTW: You can’t change the remaining code to a ternary operator, as the shuffle has to be executed.

From a performance point of view, if and assignments are fast, shuffle is slightly slower as it is a contested resource.

So the code may look uglier, but it should be no performance problem.

You can move it into an inline device function with return value or reference output parameters.

Topic		Replies	Views
[q] mask in __shfl_sync() CUDA Programming and Performance cuda , kernel	4	174	August 30, 2024
Question regarding the mask in __shfl_sync operations CUDA Programming and Performance	2	1392	January 16, 2020
How do mask work in warp primitive? CUDA Programming and Performance	4	373	May 31, 2023
Shfl_sync question CUDA Programming and Performance	1	615	October 12, 2021
confuse about warp-level mask CUDA Programming and Performance	12	3335	November 1, 2018
What does mask mean in warp shuffle functions (__shfl_sync) CUDA Programming and Performance	2	5656	November 22, 2018
Can the srcLane in __shfl_sync() function be relative? CUDA Programming and Performance cuda	2	249	April 19, 2024
__shfl_sync requires __syncwarp for every call? CUDA Programming and Performance	4	4703	February 25, 2018
How do threads play with __shfl_down_sync plus if branch CUDA Programming and Performance	2	654	September 1, 2023
Requesting clarification - CUDA WARP level primitives and THREAD divergence CUDA Programming and Performance hw , cuda	2	1287	February 14, 2024

__shfl_up_sync() mask semantics

Related topics