confuse about warp-level mask

I’m a newbie and I read the HELLO WORLD SUM below: (in Using CUDA Warp-Level Primitives | NVIDIA Technical Blog)

// input.size=32 len=25
__global__ void sum_kernel(int* input, size_t len) {
    unsigned mask = __ballot_sync(0xFFFFFFFF, threadIdx.x < len);
    int val;
    if (threadIdx.x < len) {
        val = input[threadIdx.x];
        for (int offset = 16 ; offset > 0; offset /= 2) {
            val  += __shfl_down_sync(mask, val, offset);
        }
    }
    input[threadIdx.x] = val;
}

what does the mask mean? In my opinion, the code above is wrong.
for example, input.size=32 and len=25

  1. only the #0-#24 thread will be synced?
  2. the variable val in #25-#31 will UB?
  3. when offset=16, the #0-#16’s shuffle down will return val in #17-#31?
  4. the #25-#31’s val is UB, so #9-#16’s val will UB too?

I also have some questions:

  1. the mask work on shuffle src thread or dst thread (offset=16, mask 0x01 index thead #0 or #16)?
  2. a thread is not set in mask, but also call shuffle down, the behaviour of this thread is UB?

Thanks for pointing it out. Yes there does seem to be a defect in that code. It’s being discussed internally at NVIDIA.

shuffle operations only work correctly when both source and destination thread are participating. The mask parameter was added into CUDA along with the release of the volta architecture. Volta introduced more flexible thread scheduling, which means that threads may persist in a warp-diverged state for longer than the programmer may be anticipating based on experience with previous architectures. Therefore the purpose of the mask parameter is to ensure that threads will reconverge if needed, to support the expected participation in the shuffle operation. Both source and destination threads needed/desired should be called out in the mask.

Note that this “reconvergence” mask does not address or resolve a situation where the code structure prevents a desired thread from participating. For example, if you have an if condition that prevents certain threads from participating (as the above code does), then it is the programmers responsibility to ensure that those threads are not indicated in the mask parameter, and are not needed (i.e. they are neither a source nor destination thead) for the shuffle operation. This particular point manifests itself in the code you excerpted, and is the crux of my concern, and also I believe of yours.

If a thread is not indicated as “needed” in the mask, but is needed for correct/desired operation of the shuffle operation, then that would be UB.

Hi selah,
my understanding of the code in the case you mention (size=32, len=25) was that

  • at line 3, “mask” will have set only the bits 0-24 (the threads for which threadIdx.x < len)
  • at line 5, only the threads 0-24 will enter the “if”
  • at line 6, “val” will contain the corresponding element
  • inside the for loop at line “7”, “val” is accumulated to the (current-offset) lane

IMHO the mask and the execution coherently indicate to work on all and only the threads 0-24.

So, after each iteration, the value of “val” in each thread will accumulate the values from the threads that are 16, 8, 4, 2, 1 lanes after it.
Before the loop (here I write “N” to mean “input[N]” and “##” for unused values):

0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	##	##	##	##	##	##	##

After the first iteration (offset = 16):

0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	##	##	##	##	##	##	##
+16	+17	+18	+19	+20	+21	+22	+23	+24

After the second iteration (offset = 8):

0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	##	##	##	##	##	##	##
+8	+9	+10	+11	+12	+13	+14	+15	+16	+17	+18	+19	+20	+21	+22	+23	+24
+16	+17	+18	+19	+20	+21	+22	+23	+24
+24

After the third iteration (offset = 4):

0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	##	##	##	##	##	##	##
+4	+5	+6	+7	+8	+9	+10	+11	+12	+13	+14	+15	+16	+17	+18	+19	+20	+21	+22	+23	+24
+8	+9	+10	+11	+12	+13	+14	+15	+16	+17	+18	+19	+20	+21	+22	+23	+24
+12	+13	+14	+15	+16	+17	+18	+19	+20	+21	+22	+23	+24
+16	+17	+18	+19	+20	+21	+22	+23	+24
+20	+21	+22	+23	+24
+24

After the fourth iteration (offset = 2):

0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	##	##	##	##	##	##	##
+2	+3	+4	+5	+6	+7	+8	+9	+10	+11	+12	+13	+14	+15	+16	+17	+18	+19	+20	+21	+22	+23	+24
+4	+5	+6	+7	+8	+9	+10	+11	+12	+13	+14	+15	+16	+17	+18	+19	+20	+21	+22	+23	+24
+6	+7	+8	+9	+10	+11	+12	+13	+14	+15	+16	+17	+18	+19	+20	+21	+22	+23	+24
+8	+9	+10	+11	+12	+13	+14	+15	+16	+17	+18	+19	+20	+21	+22	+23	+24
+10	+11	+12	+13	+14	+15	+16	+17	+18	+19	+20	+21	+22	+23	+24
+12	+13	+14	+15	+16	+17	+18	+19	+20	+21	+22	+23	+24
+14	+15	+16	+17	+18	+19	+20	+21	+22	+23	+24
+16	+17	+18	+19	+20	+21	+22	+23	+24
+18	+19	+20	+21	+22	+23	+24
+20	+21	+22	+23	+24
+22	+23	+24
+24

After the last iteration (offset = 1):

0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	##	##	##	##	##	##	##
+1	+2	+3	+4	+5	+6	+7	+8	+9	+10	+11	+12	+13	+14	+15	+16	+17	+18	+19	+20	+21	+22	+23	+24
+2	+3	+4	+5	+6	+7	+8	+9	+10	+11	+12	+13	+14	+15	+16	+17	+18	+19	+20	+21	+22	+23	+24
+3	+4	+5	+6	+7	+8	+9	+10	+11	+12	+13	+14	+15	+16	+17	+18	+19	+20	+21	+22	+23	+24
+4	+5	+6	+7	+8	+9	+10	+11	+12	+13	+14	+15	+16	+17	+18	+19	+20	+21	+22	+23	+24
+5	+6	+7	+8	+9	+10	+11	+12	+13	+14	+15	+16	+17	+18	+19	+20	+21	+22	+23	+24
+6	+7	+8	+9	+10	+11	+12	+13	+14	+15	+16	+17	+18	+19	+20	+21	+22	+23	+24
+7	+8	+9	+10	+11	+12	+13	+14	+15	+16	+17	+18	+19	+20	+21	+22	+23	+24
+8	+9	+10	+11	+12	+13	+14	+15	+16	+17	+18	+19	+20	+21	+22	+23	+24
+9	+10	+11	+12	+13	+14	+15	+16	+17	+18	+19	+20	+21	+22	+23	+24
+10	+11	+12	+13	+14	+15	+16	+17	+18	+19	+20	+21	+22	+23	+24
+11	+12	+13	+14	+15	+16	+17	+18	+19	+20	+21	+22	+23	+24
+12	+13	+14	+15	+16	+17	+18	+19	+20	+21	+22	+23	+24
+13	+14	+15	+16	+17	+18	+19	+20	+21	+22	+23	+24
+14	+15	+16	+17	+18	+19	+20	+21	+22	+23	+24
+15	+16	+17	+18	+19	+20	+21	+22	+23	+24
+16	+17	+18	+19	+20	+21	+22	+23	+24
+17	+18	+19	+20	+21	+22	+23	+24
+18	+19	+20	+21	+22	+23	+24
+19	+20	+21	+22	+23	+24
+20	+21	+22	+23	+24
+21	+22	+23	+24
+22	+23	+24
+23	+24
+24

So, after the last iteration, “val” for thread 0 contains the sum of all the elements, etc.

selah, Robert, what did I understand wrong ?

Here’s what I stated above:

“shuffle operations only work correctly when both source and destination thread are participating”

This statement is supported in the programming guide if you read the section on warp shuffle.

[url]https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-shuffle-functions[/url]

[url]https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-notes[/url]

For example:

“Threads may only read data from another thread which is actively participating in the __shfl_sync() command. If the target thread is inactive, the retrieved value is undefined.” Note that “__shfl_sync() command” is used to refer collectively to the shuffle operations. That is evident from a careful read of the first sentence of section B.15.2

[url]https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-description[/url]

Taking the case you identify, where the first 25 threads (0-24) are participating (i.e. selected by the if-condition), let’s refer to threads by warp lane, i.e. 0-31.

Threads 0-24 are the first 25 threads in the warp, selected by the if-condition to participate in the if-body, which includes the warp shuffle operation __shfl_down_sync.

That operation takes an offset parameter which defines the source lane for the shuffle. The offset parameter is added to the destination thread warp lane to determine the source thread warp lane. For thread 9, and considering the first pass through the for-loop, offset is 16. 16 + 9 is 25. This means warp lane 25 is the source thread for the val to be retrieved for the destination thread 9. But 25 is not participating (only 0-24 are participating). Therefore that operation is exercising UB, and the val retrieved for thread 9 is undefined.

I see.
So you are saying that the correct way would be to use “__shfl_up_sync” instead of “__shfl_down_sync”, and accumulate the sum of all elements into input[24] ?

By the way, in the example above, what happens to lanes 16-24 ? Do they try to read value from non-existing lanes 32-40 ?

Thank you,
.Andrea

I think there are multiple ways to address/correct the issue. Perhaps one way would be to initialize val with an appropriate value for all lanes, and allow all lanes to participate in the warp shuffle. This would assume that the threadblock size is a whole-number multiple of warpSize, which is good CUDA programming practice. For example, remove the if-statement, and initialize val as:

val = (threadIdx.x < len)?input[threadIdx.x]:0;

this also means you could/should probably dispense with the mask calculation, and therefore substantially changes the presented code, essentially taking you back to a basic warp shuffle reduction. Since the authors of the blog were presumably not intending to rehash that, I’m not suggesting that makes sense in the context of the blog or what the authors were setting out to accomplish, merely that it could rectify the claimed hazard.

I would encourage you to read the warp shuffle documentation I’ve already linked. The answer is contained there.

In short, with respect to shuffle up and shuffle down, if the computed source lane ID is “out of range” (less than 0 for shuffle up, or greater than 31 for shuffle down) the operation does not “wrap around” in any way, and the destination lane value is unchanged. This is not considered UB. It is legal usage.

I see that I had misunderstood the behaviour of __shfl_down_sync().
If “the destination lane value is unchanged” it means that executing

val = __shfl_down_sync(0xFFFFFF, val, 16);

for lanes 0-15, val accumulates the value from lanes 16-31; while for lanes 16-31 it ends up duplicating its original value.

The other part I found confusing is why threads for which the source is outside of the mask are not treated in the same way, i.e. they are left unchanged.

Thank you for the explanations.

Ciao,
.Andrea

For the presented code/example, again assuming participating of threads 0-24, it means that for threads 0-8, they would pick up the value from lanes 16-24, whereas for threads 9-15, their value picked up is undefined. And this is the problem.

In the case where all threads are participating, and assuming we are using this for a typical warp shuffle sum reduction, then lanes 0-15 would pick up values from lanes 16-31, and lanes 16-31 end up duplicating their original value. This duplication of original value is irrelevant for the warp sum reduction, since after the first iteration of the for-loop, the accumulated values in lanes 16-31 play no part in composition of the final partial sum in lane 0.

If a computed source lane is outside the mask (but within the range of 0-31) the behavior is undefined. It is not correct to say “they are left unchanged”. Probably we are in agreement here.

  • both desired destination lanes and computed source lanes must participate
  • both desired destination lanes and computed source lanes must be represented in the mask parameter
  • for shuffle up/down, if the computed source lane is outside the range of 0-31, that is an acceptable case

Yes: what I meant is that I found it confusing that “source outside 0-31” and “source not in the mask” are treated differently.

Thanks,
.Andrea

Robert, Thanks for your explaination.

Maybe Robert had already answered your question. Howerver, the code in sm_61 is correct, because all code line will run synchronized.

the key is that:

// input.size=32 len=25
    __global__ void sum_kernel(int* input, size_t len) {
        unsigned mask = __ballot_sync(0xFFFFFFFF, threadIdx.x < len);
        int val; <---- will init val is 0
        if (threadIdx.x < len) {
            val = input[threadIdx.x];
            for (int offset = 16 ; offset > 0; offset /= 2) {
                val  += __shfl_down_sync(mask, val, offset);
            }
        }
        input[threadIdx.x] = val;
    }

the val in all threads(lane#0-31) will be 0, so result a correct result.

NUM_ELEMENTS=25
  11111111 11111111 11111111 10000000
+ 11111111 10000000 11111111 10000000 <- offset=16
  22222222 21111111 22222222 20000000
+ 21111111 22222222 20000000 20000000 <- offset=8
  43333333 43333333 42222222 40000000 
+ 33334333 33334222 22224000 00000000 <- offset=4
  76667666 76667555 64446222 40000000 
+ 66766676 66755564 44622240 00000000 <- offset=2
  32323232 32312019 08068462 40000000 
+ 23232323 23120190 80684624 00000000 <- offset=1
  55555555 55432109 88642086 40000000

as Robert’s mention. if arch is sm_7x, the thread in a warp will running unsynced.
the mask will only guarantee #0-24 will be synced, at the sync time-point, the #25-31’s val maybe uninited, so the result will be UB (I type UB value as ‘-’)

NUM_ELEMENTS=25
  11111111 11111111 11111111 1-------
+ 11111111 1------- 11111111 1------- <- offset=16
  22222222 2------- 22222222 2-------
+ 2------- 22222222 2------- 2------- <- offset=8
  4------- 43333333 4------- 4------- 
+ ----4333 33334--- ----4--- -------- <- offset=4
  ----7666 76667--- ----6--- 4------- 
  ...
  UB

no, the code is not correct, even on sm_61

even on sm_61, the warp is not running in a converged state due to the if condition

the premise of this entire thread is that there will be some threads greater than len:

those threads will not participate, on any architecture

Sorry but I’m confused about the ‘converged’.
In my opinion, the if statement create a branch. sm will run all thread in on one branch concurrently, then another. running on the code will run like follow list:

// input.size=32 len=25
unsigned mask = __ballot_sync(0xFFFFFFFF, threadIdx.x < len);  # run on #0-#31
int val;                                                       # run on #0-#31, init val
val = input[threadIdx.x];                                      # run on #0-#24
for (int offset = 16 ; offset > 0; offset /= 2) {              # run on #0-#24
   val  += __shfl_down_sync(mask, val, offset);                # run on #0-#24
}

input[threadIdx.x] = val;                                      # run on #0-#24

the register store val will be init on the whole warp.

I also see you said:

does this mean: the lane#25-#31’s val maybe changed.

Please re-read my comment #8 above. I said:

For threads 9-15, as written, the val they pick up is undefined. The reason it is undefined is because when you add the first offset value (16) to 9-15, you end up with 25-31, and threads 25-31 are not participating. Your own summary indicates this: