Different runs using same parameters produce different results


I’m newbie with CUDA and hit some unexpected problem. My code does lots of iterations, bit shifts, xor operations, … and all cumulative. All happens with uint64_t variables, which are all initialized prior start. Only one thread is launched (for testing), hence no race conditions. I’m using constant memory. Hardware is RTX 2080 SUPER.

Now … same departing point, different results, but only after the first couple of loops.

Never (!) seem such a thing.

How is that possible?

1. run

 322117397           1578853418
1578853418            763126208
 763126208           3811226253 <<<
3811226253            142650721
 142650721           3268674706

2. run

 322117397           1578853418
1578853418            763126208
 763126208           3778720397 <<<
3778720397           3368598641
3368598641           3282275484

How is that possible?

In order of decreasing likelihood, I’d say:

  1. bug(s) in code; may be design issues or coding issues.
  2. compiler code generation issue
  3. hardware issue

The likelihood of item (1) is typically in the 95+% range. Depending on what kind of GPU you have, items (2) and (3) may apply in reverse. Beware in particular of heavily (vendor-)overclocked GPUs; these may have been insufficiently qualified for proper operation when using compute apps .

When you run the app under the control of cuda-memcheck, does it report any issues for the failing runs?

1 Like

Thanks for the answer.

cuda-memcheck helped quite a lot! I managed to to reduce the problem to the printf function itself, which showed me the dump above. If I comment it out, no memory errors anymore! However I’d like to understand what happens here.

Basically, I’m using inside the kernel …

uint64_t data;
printf("%lu \n", data);

… which should be fully legal code.

On the host, this doesn’t produce any issues, why inside the kernel?

=========     Address 0x00fffcf0 is out of bounds

commenting out the printf may allow the compiler to dispense with other code in your kernel as well, which means your assumptions about what is happening and where the problem lies, may be incorrect. commenting out code can be quite a confusing strategy for either performance or debug when using an aggressively optimizing compiler. Use the method described here:

with cuda-memcheck to get teh actual line of source code that is generating the out-of-bounds access.

1 Like

Indeed, after reading your SO post, I discovered 2 more bugs, all related to bad indexing with subsequent illegal global memory reads.

Many thanks!