I’m newbie with CUDA and hit some unexpected problem. My code does lots of iterations, bit shifts, xor operations, … and all cumulative. All happens with uint64_t variables, which are all initialized prior start. Only one thread is launched (for testing), hence no race conditions. I’m using constant memory. Hardware is RTX 2080 SUPER.
Now … same departing point, different results, but only after the first couple of loops.
bug(s) in code; may be design issues or coding issues.
compiler code generation issue
hardware issue
The likelihood of item (1) is typically in the 95+% range. Depending on what kind of GPU you have, items (2) and (3) may apply in reverse. Beware in particular of heavily (vendor-)overclocked GPUs; these may have been insufficiently qualified for proper operation when using compute apps .
When you run the app under the control of cuda-memcheck, does it report any issues for the failing runs?
cuda-memcheck helped quite a lot! I managed to to reduce the problem to the printf function itself, which showed me the dump above. If I comment it out, no memory errors anymore! However I’d like to understand what happens here.
Basically, I’m using inside the kernel …
uint64_t data;
...
printf("%lu \n", data);
… which should be fully legal code.
On the host, this doesn’t produce any issues, why inside the kernel?
...
========= Address 0x00fffcf0 is out of bounds
...
commenting out the printf may allow the compiler to dispense with other code in your kernel as well, which means your assumptions about what is happening and where the problem lies, may be incorrect. commenting out code can be quite a confusing strategy for either performance or debug when using an aggressively optimizing compiler. Use the method described here:
with cuda-memcheck to get teh actual line of source code that is generating the out-of-bounds access.