I’m testing CUDA in an application that implements a sieve of Eratosthenes to generate prime numbers. The program works correctly.
However, a strange thing is that when I use a dummy “if statement” where it’s condition is always evaluated to false for all threads I get an unexpected speedup.
The part of the code is attached bellow. The dummy if statement is marked with bold. It always evaluates to false because the kernel work just fine and does not halt on the while(true) loop.
On my 8400M GS laptop I get about 5,5 seconds without the “if statement” and about 5 seconds with it. That’s half a second speedup.
Is there any reason for a useless part of code to improve total performance?? Does it have to do with the optimizer?
Have a look if the number of registers changes between the two versions. The compiler cannot schedule instructions across the loop in the first case, so it may use less registers. This can then lead to higher occupancy and a speed-up.
If I create the cubin files for both versions it seems that both use the same number of registers (15). Only the bincode section seems to be different like depicted bellow:
Nope, same number. The PTX doesn’t include the final register allocation. You may have to look through the changes in the PTX code to see if anything fishy happens.