Different runs using same parameters produce different results

geohei · October 8, 2020, 7:47pm

Hi.

I’m newbie with CUDA and hit some unexpected problem. My code does lots of iterations, bit shifts, xor operations, … and all cumulative. All happens with uint64_t variables, which are all initialized prior start. Only one thread is launched (for testing), hence no race conditions. I’m using constant memory. Hardware is RTX 2080 SUPER.

Now … same departing point, different results, but only after the first couple of loops.

Never (!) seem such a thing.

How is that possible?

1. run

 322117397           1578853418
1578853418            763126208
 763126208           3811226253 <<<
3811226253            142650721
 142650721           3268674706
...

2. run

 322117397           1578853418
1578853418            763126208
 763126208           3778720397 <<<
3778720397           3368598641
3368598641           3282275484
...

njuffa · October 8, 2020, 11:11pm

How is that possible?

In order of decreasing likelihood, I’d say:

bug(s) in code; may be design issues or coding issues.
compiler code generation issue
hardware issue

The likelihood of item (1) is typically in the 95+% range. Depending on what kind of GPU you have, items (2) and (3) may apply in reverse. Beware in particular of heavily (vendor-)overclocked GPUs; these may have been insufficiently qualified for proper operation when using compute apps .

When you run the app under the control of cuda-memcheck, does it report any issues for the failing runs?

geohei · October 9, 2020, 6:43am

Thanks for the answer.

cuda-memcheck helped quite a lot! I managed to to reduce the problem to the printf function itself, which showed me the dump above. If I comment it out, no memory errors anymore! However I’d like to understand what happens here.

Basically, I’m using inside the kernel …

uint64_t data;
...
printf("%lu \n", data);

… which should be fully legal code.

On the host, this doesn’t produce any issues, why inside the kernel?

...
=========     Address 0x00fffcf0 is out of bounds
...

Robert_Crovella · October 15, 2020, 11:11pm

commenting out the printf may allow the compiler to dispense with other code in your kernel as well, which means your assumptions about what is happening and where the problem lies, may be incorrect. commenting out code can be quite a confusing strategy for either performance or debug when using an aggressively optimizing compiler. Use the method described here:

with cuda-memcheck to get teh actual line of source code that is generating the out-of-bounds access.

geohei · October 22, 2020, 7:55pm

Indeed, after reading your SO post, I discovered 2 more bugs, all related to bad indexing with subsequent illegal global memory reads.

Many thanks!

Topic		Replies	Views
Incosistent results - can't explain CUDA Programming and Performance	18	3158	May 10, 2010
TESLA ISSUE CUDA Programming and Performance	1	658	October 31, 2011
PRNG produces indeterministic results CUDA Programming and Performance	4	1305	January 21, 2011
Is this a bug? magic = ok, variable = error CUDA Programming and Performance	9	2617	October 15, 2008
the result is not always right ! CUDA Programming and Performance	2	1684	April 6, 2008
Random Corruption? CUDA Programming and Performance	4	5700	July 28, 2008
Possible nvcc bug? CUDA Programming and Performance	13	8864	January 9, 2011
"first run" of the cuda program isn't correct CUDA Programming and Performance	2	1394	October 7, 2008
Unpredictable program behavior CUDA Programming and Performance	0	1370	August 11, 2011
CUDA oddness CUDA Programming and Performance	0	2317	March 26, 2009

Different runs using same parameters produce different results

Related topics