Is it dangerous to mix warp shuffles with bitwise or logical operators in same instruction?

tugrul_192bit · February 14, 2025, 9:56pm

if(warpLaneIndex == 5){
    a = false;
}
bool something = a && b && __shfl_sync(0xFFFFFFFF, c, d); // does warp lane 5 enter into the shuffle instruction?
if(something){
    doSomething();
}

In C++, such operations are early-quitted if a is false or b is false. Then, if a or b is different per warp lane, what does CUDA do? Still run the shuffle? Or are some threads missing in there?

Also, does the above calculation of logical AND create any warp divergence on some architectures due to early quitting on some warp lanes? Even if the result of shuffle is always same for all warp lanes (the only difference coming from a or b)?

Do we have to put these shuffles always to the front-most part of the calculation? What happens if there are multiple shuffles?

bool something = __shfl_sync() && __shfl_sync() && __shfl_sync();

Robert_Crovella · February 14, 2025, 10:10pm

CUDA is largely C++. There are no special rules for this case. If C++ short-circuit evaluation results in this:

being executed for some threads in the warp but not others, then you have an instance of illegal coding (probably - unless you have accounted for this elsewhere in your code), because you have explicitly required all threads in the warp to “report” (due to the 0xFFFFFFFF mask) and have specifically prevented that possibilty (in so far as what you have shown, anyway).

As indicated above, it seems like conclusions/guidance for various cases can be derived from C++ knowledge/rules as well the CUDA shuffle requirement that the warp “attendance” match (or “satisfy”) the warp lane mask.

tugrul_192bit · February 14, 2025, 10:14pm

Yes, all threads are meant to enter the shuffle, signifying with 0xFFFFFFFF, from a code. I only noticed some performance difference (10x) when the result of a or b changes a lot and wanted to be sure if this is the issue even though code was working fine (at least for some testing cases).

Curefab · February 14, 2025, 10:29pm

You can always write

if(warpLaneIndex == 5){
    a = false;
}
bool s = __shfl_sync(0xFFFFFFFF, c, d); // move the shuffle outside the shortcut evaluation
bool something = a && b && s;
if(something){
    doSomething();
}

tugrul_192bit · February 14, 2025, 10:31pm

So the compiler does not put it anywhere inside or backside of the bool something = line? This is good to know.

Robert_Crovella · February 14, 2025, 10:36pm

This may be of interest.

tugrul_192bit · February 14, 2025, 10:39pm

I shall not use swarp shuffles in constructors unless absolutely necessary. Copy constructor or assignment operator can change how one calls constructor with warp shuffle in it.

Even adding a printf in kernel can change the behavior. For example, if printf causes kernel to use a different memory space for some temporary variables (with constructors containing warp shuffles…), it may cause similar issue.

Robert_Crovella · February 14, 2025, 10:48pm

I was reading that article I linked with a wider/different focus. Here’s what seems evident to me:

This is device code compiler. I assume we are talking about a device code compiler provided by NVIDIA.
If you write C++ (device) code that as written is correct, but under some unspecified optimization becomes “incorrect”, then my view is that is the responsibility of the compiler to “not do that”.
If the compiler will/does do that, it seems (to me) that it could be essentially impossible to write correct code.
If you found a (verified) instance of that, it might make sense to file a bug.

I think your statements about printf are only imagination. There’s no reason to assume that a function call provided by a NVIDIA library that is intended to be usable from a single thread should somehow be unreliable when its used from multiple threads. It is another example of something that we could just assume “should not be the case” (because if we assumed it is possible, then it raises the possibility that it is impossible to write correct code) and if you found an example of it, probably consider filing a report. It’s not possible to write documentation that explicitly excludes every imaginable case. You have to assume that the CUDA developers intend to give you tools that do not make it impossible to write correct code.

Most of the above is my opinion, It’s what seems evident to me, or what I would do. I certainly cannot argue the case about what kind of documentation is possible to create. It’s just a viewpoint.

Curefab · February 14, 2025, 10:49pm

I guess in theory (C++ standard) the compiler could change it to the seemingly identical

if(warpLaneIndex == 5){
    a = false;
    bool s = __shfl_sync(0xFFFFFFFF, c, d); // move the shuffle outside the shortcut evaluation
    bool something = a && b && s;
    if(something){
        doSomething();
    }
} else {
    bool s = __shfl_sync(0xFFFFFFFF, c, d); // move the shuffle outside the shortcut evaluation
    bool something = a && b && s;
    if(something){
        doSomething();
    }
}

and optimize this to

if(warpLaneIndex == 5){
    __shfl_sync(0xFFFFFFFF, c, d); // edited/readded: shuffle probably has to stay in as it has "side-effects"
} else {
    bool s = __shfl_sync(0xFFFFFFFF, c, d); // move the shuffle outside the shortcut evaluation
    bool something = a && b && s;
    if(something){
        doSomething();
    }
}

But that would lead to an invalid Cuda program. I think Nvidia tried to prevent those rearrangements or optimizations within nvcc. Probably at least whenever a *_sync function is called. Or at least the *_sync functions are always executed, similar to volatile operations and not removed.

Some sync instructions demand (or at least demanded in earlier PTX or architecture versions?) that the very same sync instruction (it was documented as same PTX instruction, which could - undocumented - mean same resulting SASS instruction at the same memory code location) is executed by all participating threads of a warp or block.

I am not sure, if that condition could be 100% enforced from any C++ source code or if there were extreme cases, which lead to faulty compiled kernels?

The base of the compiler was not originally made for Cuda compilation. And in C++ there exists the as-if rule that any equivalent code having the same observable effects can replace the programmed code. That is a dangerous mixture. There have to be theoretical and practical considerations for nvcc to keep it working correctly.

tugrul_192bit · February 14, 2025, 10:57pm

Even if all warp lanes enter two branches with existing shuffle instruction, do they still count as same instruction? I mean, even if the instruction is exactly same with exact same parameters, having a different code line (memory address on the instruction cache) makes this bad?

What if one of branches reaches out of instruction cache and fetches from global memory?

Robert_Crovella · February 14, 2025, 10:59pm

That is permissible in Volta+. You may wish to read the article I linked.

A shuffle op works from registers (only). If there is an external data retrieval needed to populate the register, that is separate from the shuffle op, at the PTX or SASS level. The fact that some warp lanes may retrieve their shuffle target (made available to other threads) from memory space X and some warp lanes may retrieve their shuffle target from memory space Y is independent/irrelevant, and at the point of the PTX or SASS instruction, is not evident, anyway.

Volta+ ITS allows for warp fragments to operate and behave independently (although this particular data-retrieval-source possibility would not cause trouble even in the pre-volta case).

tugrul_192bit · February 14, 2025, 11:03pm

Then I assume there’s no caching of shuffle parameters & return values (to skip it altogether for all warp lanes, to save some shuffle throughput for other warps that may be in need).

Curefab · February 14, 2025, 11:04pm

It is related to the .aligned specifier of PTX instructions (here for barrier, similar for other instructions) - from the PTX manual:

Instruction barrier{.cta} has optional .aligned modifier. When specified, it indicates that all threads in CTA will execute the same barrier{.cta} instruction. In conditionally executed code, an aligned barrier{.cta} instruction should only be used if it is known that all threads in CTA evaluate the condition identically, otherwise behavior is undefined.

And it is also - for some instructions - related to the Compute Capability (As Robert said Pre-Volta vs. Volta+), here for the barrier instruction:

For .target sm_6x or below,

barrier{.cta} instruction without .aligned modifier is equivalent to .aligned variant and has the same restrictions as of .aligned variant.

All threads in warp (except for those have exited) must execute barrier{.cta} instruction in convergence.

Curefab · February 14, 2025, 11:14pm

Probably not. If you call the identical shuffle instruction twice, it probably is executed twice.

I believe (but do not know) that the C++ optimizer inside nvcc has no notion of the values of variables in other threads (except for generating uniform instructions, which are the same warp-wide) or at least does not follow the values or try to predict them. I think it handles values read from shuffle like unpredictable user inputs. A shuffle instruction is probably internally coded as we read an unknown value like - on a PC or microcontroller - from hardware or from an independent thread or from volatile. Volatile memory accesses (also) have to be executed each time and in order in regards to other volatile instructions.

And at the same time even the first shuffle operation (of the two identical ones) could not be spared, although the result would not be used, as the shuffle also stores values (into other threads/lanes).

Something to try out with an (for the compiler) obvious case.

Curefab · February 14, 2025, 11:30pm

To the example in the beginning:

To keep the optimization, if no warp needs the shuffle instruction.
Warp vote functions are faster than shuffle instructions and probably not shared between the SM Partitions.

if (__any_sync(FULL_MASK, a && b)) {
    bool s = __shfl_sync(0xFFFFFFFF, c, d);

    bool something = a && b && s;
    if(something){
        doSomething();
    }
}

In your example, in which you share booleans, a __ballot_sync instead probably is the more performant option anyway:

bool s = __ballot_sync(FULL_MASK, c) & (1 << d);
bool something = a && b && s;
if(something){
    doSomething();
}

tugrul_192bit · February 16, 2025, 10:19pm

I have a code similar to the following and kernel completes 10x quicker without any error/hang when result variable is mostly a bit value that creates 0xFFFFFFFF in mask variable:

    unsigned int mask = 0;
    for (int chunk= 0; chunk< NUM_CHUNKS; chunk++) {
        ...
        unsigned int k= __ballot_sync(0xFFFFFFFF, m);
        ...
        unsigned int result = __match_any_sync(0xFFFFFFFF, data);
        ...
        mask = mask | result; // this is no-op if mask is 0xFFFFFFFF
    }
    output[id] = mask;

When the “mask” variable has all its bits set quickly, the kernel completes 10x faster. When the “mask” variable stays zero or rarely becomes non-zero, it runs with highest timing. The program runs without any error.

It is 10x fast again with:

mask = result | mask;

so the order of variables is not affecting anything here.

Does the same CUDA warp-shuffle thread attendance rule apply for the loop early-quit optimization too?

The loop becomes something like this:

28	0000000a 00c7cab0	      ISETP.GE.U32.AND P0, PT, R19, 0x10, PT
29	0000000a 00c7cac0	      ISETP.GT.U32.AND P1, PT, R19, 0xf, PT
30	0000000a 00c7cad0	      P2R R0, PR, RZ, 0x2
31	0000000a 00c7cae0	@P0   LDG.E R2, [R16.64+-0x40]
32	0000000a 00c7caf0	      ISETP.NE.AND P0, PT, R25, RZ, PT
33	0000000a 00c7cb00	      SEL R29, R26, 0xffff0000, !P0
34	0000000a 00c7cb10	      **BRA.DIV** ~URZ, 0xa00c7cdf0
35	0000000a 00c7cb20	      ISETP.GT.U32.AND P0, PT, R19, 0xf, PT
36	0000000a 00c7cb30	      **VOTE.ANY** R18, PT, P0
37	0000000a 00c7cb40	      **BRA.DIV** ~URZ, 0xa00c7ce60
38	0000000a 00c7cb50	      **MATCH.ANY** R4, R2
39	0000000a 00c7cb60	      ISETP.GE.U32.AND P0, PT, R19, 0x10, PT
40	0000000a 00c7cb70	      IADD3 R3, R23, 0x1, RZ
41	0000000a 00c7cb80	@P0   LDG.E R2, [R16.64]
42	0000000a 00c7cb90	      ISETP.NE.AND P0, PT, R24, R3, PT
43	0000000a 00c7cba0	      LOP3.LUT R3, R18, R4, R29, 0x80, !PT
44	0000000a 00c7cbb0	      SEL R31, R26, 0xffff0000, !P0
45	0000000a 00c7cbc0	      LOP3.LUT R29, R3, R28, RZ, 0xfc, !PT
46	0000000a 00c7cbd0	      **BRA.DIV** ~URZ, 0xa00c7cec0
47	0000000a 00c7cbe0	      **MATCH.ANY** R4, R2
48	0000000a 00c7cbf0	      ISETP.GT.U32.AND P0, PT, R19, 0xf, PT
49	0000000a 00c7cc00	      **VOTE.ANY** R18, PT, P0
50	0000000a 00c7cc10	      IADD3 R23, R23, 0x2, RZ
51	0000000a 00c7cc20	      IADD3 R16, P1, R16, 0x80, RZ
52	0000000a 00c7cc30	      ISETP.NE.AND P0, PT, R23, 0x186a, PT
53	0000000a 00c7cc40	      LOP3.LUT R4, R18, R4, R31, 0x80, !PT
54	0000000a 00c7cc50	      IMAD.X R17, RZ, RZ, R17, P1
55	0000000a 00c7cc60	      IADD3 R25, R25, 0x2, RZ
56	0000000a 00c7cc70	      LOP3.LUT R28, R4, R29, RZ, 0xfc, !PT
57	0000000a 00c7cc80	@P0   BRA 0xa00c7cab0

There are multiple branch instructions. Last one is looping, the other 3 BRA.DIV is not by CUDA code. Compiler adds them. Compiler also adds 2 more vote/match_any instructions. So each loop iteration is doing something extra with match_any.

Edit:

Since program ran fine, I was guessing the for loop is not early-quitted but then saw the BRA.DIV instructions in the disassembly, and started thinking perhaps its adding the early quit.

Robert_Crovella · February 16, 2025, 10:43pm

I can’t really tell what you are saying:

So you think there is no early-quit optimization.

In the very next sentence you are saying there probably is one.

Confusing.

I doubt there is an early-quit optimization based on the mask value, “inserted” by the compiler. First of all it would surprise me, but more importantly it would break the previous _sync ops, and I just indicated previously in this thread that I thought the compiler ought not to insert optimizations that would take a correctly written code and turn it into an illegal code.

It’s just my impressions. You can certainly identify what the compiler has actually done, with enough study.

tugrul_192bit · February 16, 2025, 11:15pm

I’m sorry for the confusion, I was editing the text while checking output of assembly and mixed the places of conclusions. So, it’s safe to assume a loop is not optimized out differently between warp lanes even if loop body includes a warp shuffle. Thanks.

system · March 2, 2025, 11:15pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Illegal instruction (error 715) with H100 CUDA Programming and Performance	27	239	September 18, 2024
CUDA Kernel self-suspension ? Can a CUDA Kernel conditionally suspend its execution ? CUDA Programming and Performance	46	45207	April 17, 2011
On the register allocation optimization of cuda compiler CUDA Programming and Performance	12	3249	January 20, 2019
Problem with correct branching within a warp CUDA Programming and Performance	23	15640	May 28, 2009
Performance penalty due to warp divergence CUDA Programming and Performance	9	1611	May 18, 2023
Race condition within warp CUDA Programming and Performance	9	2958	September 20, 2016
Bitslice-DES optimization CUDA Programming and Performance	55	12601	January 29, 2022
Measurements of different CUDA operator throughputs CUDA Programming and Performance	32	49891	August 24, 2009
How to tell nvcc that some `if` must diverge and stop trying to fuse previous statements into it? CUDA Programming and Performance	20	462	March 3, 2024
Matrix Multiplication Garbage value :( CUDA Programming and Performance	10	3406	July 25, 2009

Is it dangerous to mix warp shuffles with bitwise or logical operators in same instruction?

Related topics