Warp wide reduction operations significant slowdown since CUDA 12.4 for sm_80

natan88a · May 25, 2024, 2:45pm

Hi,

Since upgrading to CUDA 12.4 (and it seems to continue in CUDA 12.5), I’ve noticed a significant slowdown in warp wide reduction operations (__ballot_sync, __any_sync, __shfl_sync, etc.) when compiling to A100 architecture (-arch sm_80). This slowdown seems to be specific to the case where the participating threads mask is unknown at compile time or it doesn’t include all the threads in the warp. For example, the following (dummy) code, which uses (I assume) a common CUDA programming pattern, will be slower when compiling with CUDA 12.4+:

  for (uint32_t i = 0; i < size; i += 32) {
    uint32_t offset = i + (threadIdx.x & 0x1f);
    // Mask is known at compile time and it includes the whole warp, so no impact here
    uint32_t mask = __ballot_sync(-1U, offset < size);
    if (offset < size) {
      uint32_t clk = clock();
      // Here mask isn't known at compile time, and the SASS will be different
      if (__any_sync(mask, (clk & 0x1) != 0)) {
        sum += __shfl_sync(mask, clk, threadIdx.x & 0x1f);
      }
    }
  }

Even if size % 32 == 0, i.e. mask == -1U for all iterations, the code will run slower and have much more instructions.

Here is the comparison of the generated SASS (I think the PTX is similar):

There’s a new instruction - R2UR, that creates the predicate P2, which is later used for the predicated branch instruction BRA.DIV UR4 (which in the past wasn’t predicated).
This predicate changed the execution flow. It seems that if in the past, all the threads (of the warp) were convergent at this point, even if the mask wasn’t -1U, but the “or” of all the mask was -1U, then this branch wasn’t taken. Since CUDA 12.4, probably due to this predicate, this branch is taken if the mask isn’t -1U (even if the warp is convergent at this point), it looks like this branch leads to expensive sync code.
Even if the branch isn’t taken (i.e. mask == -1U), the program still executes more instructions due to new match and vote instructions that weren’t there before. So unless the mask is known at compile time, this pattern will exhibit a slowdown.

So my question is was there a good reason for this change (e.g. bugfix)? If yes, what was the reason? and how to avoid this slowdown (e.g. maybe the above pattern isn’t good)?

Thank you,
Natan

njuffa · May 25, 2024, 9:25pm

(1) Have you tried annotating the branches with [[likely]] and [[unlikely]] attributes (these are a C++20 feature)? My experiments show that this does influence how branches are compiled by Clang. I have not examined this methodically with the CUDA compiler yet, but both are based on the LLVM infrastructure. As long as the CUDA toolchain does not support profile-guided optimizations, use of these attributes is the best chance of providing branch probability information to the compiler, but compilers are free to ignore such hints, of course.

(2) After staring at this code for some time, I am still not sure what the use of clock() is supposed to accomplish here. This usage strikes me as unusual, but maybe I missed a “common CUDA programming pattern”?

(3) Optimizing compilers employ a large collection of heuristics. Heuristics rarely improve all instances they are applied to. Usually they improve a large percentage of applicable cases, make not difference in some more cases, and actively hurt in a few instances.

In addition, heuristics (and the order in which they are applied) can lead to both constructive and destructive interactions, complicating their use. As a result, heuristics may be re-tuned every now and then. It is entirely possible that what you are observing is the result of such a heuristic re-tuning, which improved performance a large percentage of case but hurt your case. A performance bug in the compiler is likewise possible, but somewhat unlikely, as compilers are typically validated against a large corpus of code used in regression testing. As you already mention, it is also possible that the difference you observe is a fix for a previously existing functional compiler bug, a hardware issue, or a security concern such as a side-channel attack. If this the result of a planned change, NVIDIA is unlikely to explain the rationale for changing internals of the compiler.

If this code generation difference leads to a significant difference in your application’s performance, you might want to file a bug report with NVIDIA.

natan88a · May 25, 2024, 11:12pm

Hi njuffa, thanks for the answer. Addressing your points:

(1) Yes, I’ve tried these attributes, and they don’t seem to make a difference (I’m also not sure how this is related).

(2) It’s just a dummy code (the clock usage is just to avoid opt-outs, could be replaced by memory access at offset). The common pattern is the the active mask discovery, and the using the mask in warp wide operations:

uint32_t active_mask = __ballot_sync(FULL_WARP_MASK,  cond);
if (cond) {
/// some warp wide operations that use the active threads
}

I understand why it’s unlikely for NVIDIA to expose internals, but as a user, I need a way to decide if to upgrade the toolkit or stay with 12.3. It’ll help to know at least to which of the categories you’ve mentioned, this change is related to.

I’ve already filed a bug report, thank you again.

Natan

njuffa · May 26, 2024, 12:19am

Coming from an industry perspective I would claim that tool chains should be upgraded infrequently, and that these decisions can be made on a purely pragmatic basis. If a new version of a tool chain causes a performance regression in one’s bread & butter application, simply don’t update. It is not unusual for companies to have standardized on a tool chain that is several years old for their production development.

Even for personal use, unless some new feature is urgently needed, I see no need to operate on the bleeding edge. These days, I update my host system tool chain about every ten years, and for CUDA I went from 9.2 to 11.1 to 12.3 over the past several years. I am perfectly happy having other people ferret out the bugs in early releases of major versions; I’d rather live with the few bugs I am aware of in “old” software than having to constantly deal with a fresh set of new bugs yet to be discovered.

Yuki_Ni · November 13, 2024, 8:01am

Hi @natan88a ,I didn’t get your further reply in ticket 4668348 . I tried to put the codes into a runnable case and check its runtime performance on a A100 .
test.cu (1.1 KB)

We didn’t see actual performance regression in CUDA 12.4 compared to CUDA 12.1 .

12.4-

==PROF== Disconnected from process 79138
[79138] out124@127.0.0.1
  Kernel(int, int *) (32, 1, 1)x(32, 1, 1), Context 1, Stream 7, Device 0, CC 8.0
    Section: GPU Speed Of Light Throughput
    ----------------------- ----------- ------------
    Metric Name             Metric Unit Metric Value
    ----------------------- ----------- ------------
    DRAM Frequency                  Ghz         1.21
    SM Frequency                    Mhz       224.46
    Elapsed Cycles                cycle        5,969
    Memory Throughput                 %         0.17
    DRAM Throughput                   %         0.01
    Duration                         us        26.59
    L1/TEX Cache Throughput           %         0.40
    L2 Cache Throughput               %         0.17
    SM Active Cycles              cycle     1,488.22
    Compute (SM) Throughput           %         0.81
    ----------------------- ----------- ------------

12.1 -

[77693] out121@127.0.0.1
  Kernel(int, int *) (32, 1, 1)x(32, 1, 1), Context 1, Stream 7, Device 0, CC 8.0
    Section: GPU Speed Of Light Throughput
    ----------------------- ----------- ------------
    Metric Name             Metric Unit Metric Value
    ----------------------- ----------- ------------
    DRAM Frequency                  Ghz         1.21
    SM Frequency                    Mhz       224.56
    Elapsed Cycles                cycle        7,323
    Memory Throughput                 %         0.12
    DRAM Throughput                   %         0.01
    Duration                         us        32.61
    L1/TEX Cache Throughput           %         0.32
    L2 Cache Throughput               %         0.12
    SM Active Cycles              cycle     1,889.73
    Compute (SM) Throughput           %         0.72

If you have extra question , please comment back to us in the ticket .

system · November 27, 2024, 8:02am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
about the __syncwarp() in P100 CUDA Programming and Performance	11	4206	June 6, 2018
Relaxed __syncthreads() proposal. CUDA Programming and Performance	15	11705	January 2, 2011
do warp vote functions cause branching? CUDA Programming and Performance	16	3758	August 11, 2010
Using CUDA Warp-Level Primitives Technical Blog	20	2145	April 15, 2024
Requesting clarification - CUDA WARP level primitives and THREAD divergence CUDA Programming and Performance hw , cuda	3	1155	February 14, 2024
CUDA warp branching problem Stange bug due to unexpected warp branching order CUDA Programming and Performance	17	12215	July 11, 2010
Is it dangerous to mix warp shuffles with bitwise or logical operators in same instruction? CUDA Programming and Performance	18	192	March 2, 2025
[Solved] CUDA code works on Kepler but not on Maxwell CUDA Programming and Performance	10	3248	October 5, 2015
Is there warp divergence in reduce0 kernel which is implemented in the CUDA sample Reduction? CUDA Programming and Performance	4	926	January 8, 2020
Performance penalty due to warp divergence CUDA Programming and Performance	9	1998	May 18, 2023

Warp wide reduction operations significant slowdown since CUDA 12.4 for sm_80

Related topics