Since upgrading to CUDA 12.4 (and it seems to continue in CUDA 12.5), I’ve noticed a significant slowdown in warp wide reduction operations (__ballot_sync, __any_sync, __shfl_sync, etc.) when compiling to A100 architecture (-arch sm_80). This slowdown seems to be specific to the case where the participating threads mask is unknown at compile time or it doesn’t include all the threads in the warp. For example, the following (dummy) code, which uses (I assume) a common CUDA programming pattern, will be slower when compiling with CUDA 12.4+:
for (uint32_t i = 0; i < size; i += 32) {
uint32_t offset = i + (threadIdx.x & 0x1f);
// Mask is known at compile time and it includes the whole warp, so no impact here
uint32_t mask = __ballot_sync(-1U, offset < size);
if (offset < size) {
uint32_t clk = clock();
// Here mask isn't known at compile time, and the SASS will be different
if (__any_sync(mask, (clk & 0x1) != 0)) {
sum += __shfl_sync(mask, clk, threadIdx.x & 0x1f);
}
}
}
Even if size % 32 == 0, i.e. mask == -1U for all iterations, the code will run slower and have much more instructions.
Here is the comparison of the generated SASS (I think the PTX is similar):
There’s a new instruction - R2UR, that creates the predicate P2, which is later used for the predicated branch instruction BRA.DIV UR4 (which in the past wasn’t predicated).
This predicate changed the execution flow. It seems that if in the past, all the threads (of the warp) were convergent at this point, even if the mask wasn’t -1U, but the “or” of all the mask was -1U, then this branch wasn’t taken. Since CUDA 12.4, probably due to this predicate, this branch is taken if the mask isn’t -1U (even if the warp is convergent at this point), it looks like this branch leads to expensive sync code.
Even if the branch isn’t taken (i.e. mask == -1U), the program still executes more instructions due to new match and vote instructions that weren’t there before. So unless the mask is known at compile time, this pattern will exhibit a slowdown.
So my question is was there a good reason for this change (e.g. bugfix)? If yes, what was the reason? and how to avoid this slowdown (e.g. maybe the above pattern isn’t good)?
(1) Have you tried annotating the branches with [[likely]] and [[unlikely]] attributes (these are a C++20 feature)? My experiments show that this does influence how branches are compiled by Clang. I have not examined this methodically with the CUDA compiler yet, but both are based on the LLVM infrastructure. As long as the CUDA toolchain does not support profile-guided optimizations, use of these attributes is the best chance of providing branch probability information to the compiler, but compilers are free to ignore such hints, of course.
(2) After staring at this code for some time, I am still not sure what the use of clock() is supposed to accomplish here. This usage strikes me as unusual, but maybe I missed a “common CUDA programming pattern”?
(3) Optimizing compilers employ a large collection of heuristics. Heuristics rarely improve all instances they are applied to. Usually they improve a large percentage of applicable cases, make not difference in some more cases, and actively hurt in a few instances.
In addition, heuristics (and the order in which they are applied) can lead to both constructive and destructive interactions, complicating their use. As a result, heuristics may be re-tuned every now and then. It is entirely possible that what you are observing is the result of such a heuristic re-tuning, which improved performance a large percentage of case but hurt your case. A performance bug in the compiler is likewise possible, but somewhat unlikely, as compilers are typically validated against a large corpus of code used in regression testing. As you already mention, it is also possible that the difference you observe is a fix for a previously existing functional compiler bug, a hardware issue, or a security concern such as a side-channel attack. If this the result of a planned change, NVIDIA is unlikely to explain the rationale for changing internals of the compiler.
If this code generation difference leads to a significant difference in your application’s performance, you might want to file a bug report with NVIDIA.
Hi njuffa, thanks for the answer. Addressing your points:
(1) Yes, I’ve tried these attributes, and they don’t seem to make a difference (I’m also not sure how this is related).
(2) It’s just a dummy code (the clock usage is just to avoid opt-outs, could be replaced by memory access at offset). The common pattern is the the active mask discovery, and the using the mask in warp wide operations:
uint32_t active_mask = __ballot_sync(FULL_WARP_MASK, cond);
if (cond) {
/// some warp wide operations that use the active threads
}
I understand why it’s unlikely for NVIDIA to expose internals, but as a user, I need a way to decide if to upgrade the toolkit or stay with 12.3. It’ll help to know at least to which of the categories you’ve mentioned, this change is related to.
Coming from an industry perspective I would claim that tool chains should be upgraded infrequently, and that these decisions can be made on a purely pragmatic basis. If a new version of a tool chain causes a performance regression in one’s bread & butter application, simply don’t update. It is not unusual for companies to have standardized on a tool chain that is several years old for their production development.
Even for personal use, unless some new feature is urgently needed, I see no need to operate on the bleeding edge. These days, I update my host system tool chain about every ten years, and for CUDA I went from 9.2 to 11.1 to 12.3 over the past several years. I am perfectly happy having other people ferret out the bugs in early releases of major versions; I’d rather live with the few bugs I am aware of in “old” software than having to constantly deal with a fresh set of new bugs yet to be discovered.
Hi @natan88a ,I didn’t get your further reply in ticket 4668348 . I tried to put the codes into a runnable case and check its runtime performance on a A100 . test.cu (1.1 KB)
We didn’t see actual performance regression in CUDA 12.4 compared to CUDA 12.1 .
12.4-
==PROF== Disconnected from process 79138
[79138] out124@127.0.0.1
Kernel(int, int *) (32, 1, 1)x(32, 1, 1), Context 1, Stream 7, Device 0, CC 8.0
Section: GPU Speed Of Light Throughput
----------------------- ----------- ------------
Metric Name Metric Unit Metric Value
----------------------- ----------- ------------
DRAM Frequency Ghz 1.21
SM Frequency Mhz 224.46
Elapsed Cycles cycle 5,969
Memory Throughput % 0.17
DRAM Throughput % 0.01
Duration us 26.59
L1/TEX Cache Throughput % 0.40
L2 Cache Throughput % 0.17
SM Active Cycles cycle 1,488.22
Compute (SM) Throughput % 0.81
----------------------- ----------- ------------
12.1 -
[77693] out121@127.0.0.1
Kernel(int, int *) (32, 1, 1)x(32, 1, 1), Context 1, Stream 7, Device 0, CC 8.0
Section: GPU Speed Of Light Throughput
----------------------- ----------- ------------
Metric Name Metric Unit Metric Value
----------------------- ----------- ------------
DRAM Frequency Ghz 1.21
SM Frequency Mhz 224.56
Elapsed Cycles cycle 7,323
Memory Throughput % 0.12
DRAM Throughput % 0.01
Duration us 32.61
L1/TEX Cache Throughput % 0.32
L2 Cache Throughput % 0.12
SM Active Cycles cycle 1,889.73
Compute (SM) Throughput % 0.72
If you have extra question , please comment back to us in the ticket .