Unexecuted warp shuffle function makes my program slower

I had a performance issue about my kernel, the code is something like:

__device__ void runPath1() {
  // a lot of code
  int x = ...;
  x += __shfl_up_sync(0xffffffff, x, 1);
}

__device__ void runPath2() {
  // a lot of code
}

__global__ void myKernel() {
  const int k = ...; // get k
  bool b = k == 1;
  if (b) {
    runPath1();
  } else {
    runPath2();
  }
}

void test() {
  myKernel<<<gridSize, blockSize>>>();
}

In my program, the bool `b` is always false at runtime, so the device function runPath1() is never executed.

I think my code should have the same performance as

if (b) {
} else {
  runPath2();
}

However, if I commented out the __shfl_up_sync() line in runPath1(), or the line where I call function runPath1, I get a significant better performance (execute time is accelerated from about 0.042s to 0.025s).

If I commented out the critical __shfl_up_sync() line, Nsight told me that my memory throughput is increased from 43% to 65% and SM throughput is increased from 34% to 51%.

After some thinking, I guess it is the __shfl_up_sync() affects nvcc compiler. Even if it is not executed at runtime, the nvcc compiler could see it and finally generates different device code.

I tried to tell the compiler that `b` is always the same in a warp, but my program is still slow:

bool b = k == 1;
// now the `b` should be the same in this warp, I wish compiler has more information to do optimization
b = __all_sync(0xffffffff, b);
if (b) {
  runPath1();
} else {
  runPath2();
}

Thanks for your attention.

A similar question: Thread divergence when block size is equal to warp size

What kind of state does runPath1 and runPath2 change? Global memory, shared memory, etc.? With shuffle it also changes the state of other threads in the warp.

If there is the possibility of some state being changed, less optimizations are possible including combining code from outside the if clause with inside.

Perhaps the state change can be decoupled.

Also consider using Nsight Compute to exactly see, where the time was added.

Thanks for your reply!

Both runPath1() and runPath2() write global memory.

It is likely the shuffle builtin function results in less optimizations, but I have no idea about how to avoid that.

May be I have to write my code like

__global__ void myKernel1() {
  runPath1();
}
__global__ void myKernel2() {
  runPath2();
}

void test() {
  // check `b` in host function
  bool b = ...;
  if (b) {
    myKernel1<<<gridSize, blockSize>>>();
  } else {
    myKernel2<<<gridSize, blockSize>>>();
  }
}

which I really don’t want to. :(

I’m following your suggestions and still seeking a better solution.

If b can be even indirectly determined from the host side, you can use templates for the kernel function to send parameters at compile time. So you don’t have several separate functions.

The compiler typically tries to inline device functions. The resource usage, for example number of registers per thread is therefore affected by both code paths, and changes in runPath1 can impact the kernel.

Your assumption is that the compiler assumes that some threads in the warp use the other code path.

And that there is a dependency from the shuffle instruction.

I doubt that is the reason. If this is the case the optimizer has missed some potential.

Use Nsight Compute first, to get better information.