I had a performance issue about my kernel, the code is something like:
__device__ void runPath1() {
// a lot of code
int x = ...;
x += __shfl_up_sync(0xffffffff, x, 1);
}
__device__ void runPath2() {
// a lot of code
}
__global__ void myKernel() {
const int k = ...; // get k
bool b = k == 1;
if (b) {
runPath1();
} else {
runPath2();
}
}
void test() {
myKernel<<<gridSize, blockSize>>>();
}
In my program, the bool `b` is always false at runtime, so the device function runPath1() is never executed.
I think my code should have the same performance as
if (b) {
} else {
runPath2();
}
However, if I commented out the __shfl_up_sync() line in runPath1(), or the line where I call function runPath1, I get a significant better performance (execute time is accelerated from about 0.042s to 0.025s).
If I commented out the critical __shfl_up_sync() line, Nsight told me that my memory throughput is increased from 43% to 65% and SM throughput is increased from 34% to 51%.
After some thinking, I guess it is the __shfl_up_sync() affects nvcc compiler. Even if it is not executed at runtime, the nvcc compiler could see it and finally generates different device code.
I tried to tell the compiler that `b` is always the same in a warp, but my program is still slow:
bool b = k == 1;
// now the `b` should be the same in this warp, I wish compiler has more information to do optimization
b = __all_sync(0xffffffff, b);
if (b) {
runPath1();
} else {
runPath2();
}
Thanks for your attention.
A similar question: Thread divergence when block size is equal to warp size