My iterate kernel function below heavily uses math functions, meaning this part uses about 95% of the GPU performance during a given iteration loop. Average MIS (mega iterations per second) is 40.
I noticed, that the simple uint64_t (result) compare operation eats up factor 37 (!) of my GPU time. If I comment out the if() statement (not only the block code inside), MIS goes up to 1500.
How is that possible ?!
I’m under the impression I miss something really stupid, but I don’t get it.
I’m not even sure if all this is CUDA related, but I never noticed anything similar in main().
__global__ void iterate_kernel() {
uint64_t block = 0x0123456789ABCDEF;
uint64_t result;
// lots of math which takes time and returns result
if (result == block) {
// this code marks a hit, which is virtually never executed
}
}
Note 1: the conditional code inside the if() block is never executed, hence not the brake responsible.
Note 2: cuda-memcheck doesn’t report errors.
It’s possible because you are working with an optimizing compiler. The compiler will remove any code that has no impact on state that is observable after the kernel has finished executing. When you remove the code in question, it probably contains writes of results to global memory, or something like that, and the removal means the compiler can also remove all the code that creates those results. (why bother creating results if they will never be visible in any way?) It’s quite possible you are ending up with an empty kernel.
This is a common question. Debugging and/or profiling and/or speed analysis of code while using “commenting out” is fraught with this peril.
I understand what you mean, but even if I empty the block inside the if() statement, the effect takes place. It really looks (?!) like the simple uint64_t comparison itself is the culprit.
if (result == block) {
// no code
}
Also … if this would be a minor performance loss, I’d get it, but factor 37 is a bit much I believe. Sure there isn’t anything else?
Your name appears on SO and Google all over the place when looking for answers regarding CUDA questions. So believe me, I believe you … :) It’s just that I have to re-wire some stuff in my brain to cope with this new compiler behavior.
At the same time, I wonder how I could circumnavigate this “brake”?
Dead code elimination is not new to the CUDA compiler or in fact any optimizing compiler going back to (at least) the 1980s.
While the GPU is essentially a 32-bit processor with 64-bit addressing capability, which means that 64-bit integer arithmetic generally is emulated, a uint64_t comparison should just take two instructions where a uint32_t comparison takes one instruction.
I checked the code again and again, and you were right !!! result is not used by the code below the if() statement, hence dead code as you said. So it had nothing to do with the uint64_t comparison. Many thanks for the kick.
Note: I didn’t post the entire code since too lengthy. Nevertheless this would have shown that result isn’t used downstream anymore.