Unexecuted warp shuffle function makes my program slower

defaw · October 20, 2025, 4:14am

I had a performance issue about my kernel, the code is something like:

__device__ void runPath1() {
  // a lot of code
  int x = ...;
  x += __shfl_up_sync(0xffffffff, x, 1);
}

__device__ void runPath2() {
  // a lot of code
}

__global__ void myKernel() {
  const int k = ...; // get k
  bool b = k == 1;
  if (b) {
    runPath1();
  } else {
    runPath2();
  }
}

void test() {
  myKernel<<<gridSize, blockSize>>>();
}

In my program, the bool `b` is always false at runtime, so the device function runPath1() is never executed.

I think my code should have the same performance as

if (b) {
} else {
  runPath2();
}

However, if I commented out the __shfl_up_sync() line in runPath1(), or the line where I call function runPath1, I get a significant better performance (execute time is accelerated from about 0.042s to 0.025s).

If I commented out the critical __shfl_up_sync() line, Nsight told me that my memory throughput is increased from 43% to 65% and SM throughput is increased from 34% to 51%.

After some thinking, I guess it is the __shfl_up_sync() affects nvcc compiler. Even if it is not executed at runtime, the nvcc compiler could see it and finally generates different device code.

I tried to tell the compiler that `b` is always the same in a warp, but my program is still slow:

bool b = k == 1;
// now the `b` should be the same in this warp, I wish compiler has more information to do optimization
b = __all_sync(0xffffffff, b);
if (b) {
  runPath1();
} else {
  runPath2();
}

Thanks for your attention.

A similar question: Thread divergence when block size is equal to warp size

Curefab · October 20, 2025, 6:57am

What kind of state does runPath1 and runPath2 change? Global memory, shared memory, etc.? With shuffle it also changes the state of other threads in the warp.

If there is the possibility of some state being changed, less optimizations are possible including combining code from outside the if clause with inside.

Perhaps the state change can be decoupled.

Also consider using Nsight Compute to exactly see, where the time was added.

defaw · October 20, 2025, 7:11am

Thanks for your reply!

Both runPath1() and runPath2() write global memory.

It is likely the shuffle builtin function results in less optimizations, but I have no idea about how to avoid that.

May be I have to write my code like

__global__ void myKernel1() {
  runPath1();
}
__global__ void myKernel2() {
  runPath2();
}

void test() {
  // check `b` in host function
  bool b = ...;
  if (b) {
    myKernel1<<<gridSize, blockSize>>>();
  } else {
    myKernel2<<<gridSize, blockSize>>>();
  }
}

which I really don’t want to. :(

I’m following your suggestions and still seeking a better solution.

Curefab · October 20, 2025, 7:26am

If b can be even indirectly determined from the host side, you can use templates for the kernel function to send parameters at compile time. So you don’t have several separate functions.

striker159 · October 20, 2025, 7:28am

The compiler typically tries to inline device functions. The resource usage, for example number of registers per thread is therefore affected by both code paths, and changes in runPath1 can impact the kernel.

Curefab · October 20, 2025, 7:33am

Your assumption is that the compiler assumes that some threads in the warp use the other code path.

And that there is a dependency from the shuffle instruction.

I doubt that is the reason. If this is the case the optimizer has missed some potential.

Use Nsight Compute first, to get better information.

Topic		Replies	Views
Is it dangerous to mix warp shuffles with bitwise or logical operators in same instruction? CUDA Programming and Performance	18	223	March 2, 2025
Performance comparison between Shfl() and _shfl_sync() CUDA Programming and Performance cuda , kernel , ubuntu	5	1292	February 16, 2022
Does __shfl_*() contains an implicit sync? CUDA Programming and Performance	7	1591	January 31, 2017
Kernel is slower after using warp shuffles CUDA Programming and Performance	4	478	March 29, 2024
Do I need to write my own warp-wide broadcasting function or will __shfl handle it efficiently? CUDA Programming and Performance	3	1676	May 30, 2017
Race condition within warp CUDA Programming and Performance	9	3172	September 20, 2016
Is this code with warp shuffle functions safe? CUDA Programming and Performance	3	669	February 3, 2019
__shfl_sync requires __syncwarp for every call? CUDA Programming and Performance	4	4614	February 25, 2018
Warp shuffle instruction not working as expected CUDA Programming and Performance	7	948	September 6, 2023
Why is __shfl slower than shared memory CUDA Programming and Performance	7	6183	November 27, 2014

Unexecuted warp shuffle function makes my program slower

Related topics