Bug in nvcc (3.2 final release) on GTX 460? Memory content inconsistent


i think i found a bug in nvcc:

This is how to reproduce it:

tar xzf buggy.tar.gz
cd Buggy
make; ./buggy
8D B6 A8 7D 12 2D 6F DA CB CB 0B CB CB CB 0B CB
5D 5D 69 5D 5D 5D 69 5D 6A 6A B5 6A 6A 6A B5 6A
72 72 D5 72 72 72 D5 72 D0 D0 67 D0 D0 D0 67 D0
47 47 01 47 47 47 01 47 15 15 54 15 15 15 54 15

That’s the correct output.

make clean; make BUG=1; ./buggy
00 00 00 00 12 2D 6F DA CB CB 0B CB CB CB 0B CB
00 00 00 00 5D 5D 69 5D 6A 6A B5 6A 6A 6A B5 6A
00 00 00 00 72 72 D5 72 D0 D0 67 D0 D0 D0 67 D0
00 00 00 00 47 47 01 47 00 00 00 00 15 15 54 15
^^^^^^^^^^^ This part is clearly wrong.

The only difference of the two versions is in whirl.cu, line 673:
Whith BUG defined, an integer array is copied using a for loop, without BUG defined, it is copied by memcpy.

The bug disappears, when the loop in line 650 is
unrolled, i.e. when instead of
for (r = 0; r < 2; r++)
Transform2(K, 0);
Transform1(S, K);

this code is used:

Transform2(K, 0);
Transform1(S, K);
Transform2(K, 0);
Transform1(S, K);

Is this a compiler bug or a bug in hardware?
(I reproduced the bug whith another GTX 460 card, too).

I could not reproduce it on Tesla C1060.

cuda-memcheck does not complain.

Please help me!

Kind regards
nvidia-bug-report.log.gz (46.4 KB)
buggy.tar.gz (4.58 KB)

Thank you for bringing this issue to our attention, and for providing a standalone repro case. I am able to reproduce the issue on an RHEL 64-bit Linux system running a C2050, compiling with the compiler from CUDA 3.2 (final release). I will follow up with the compiler team.

Using the compiler flag -Xptxas -O0 made the issue disappear in my experiments and is thus a possible workaround. The workaround you identified, i.e. using memcpy(), very likely provides better performance.

Thanks for helping out with this issue!

Unfortunately the memcpy trick is no real workaround for me, because the provided repro case is a heavily strippped down version of a much larger application. I have it just stripped down to the point where it shows the error. The calculations done have nearly nothing in common with the original code.

I assume, the bug occurs more than once in my original code because the use of memcpy in and comparable code parts fixes only one of the problems. The rest of the code provides incorrect results, too.

Could you reproduce the problem on any gpu with compute capability < 2.0?

On my Tesla C1060 station it seems to work fine (even with SDK 3.2).

I will try on my GTX 285, soon.

Thank you very much for your help - this is production code which has been running just fine up to now (on sm_1.3) and i am eagerly expecting my larger Fermi-based machines to work, too.



I did not try to reproduce the problem for platforms other than Fermi/Linux, which I understood to be the platform you use. As I said, the only other workaround I could find is to turn off optimizations in ptxas by passing -Xptxas -O0 to nvcc. Applied to your entire code it would likely result in a serious drop in performance, which is why I thought the memcpy() should work better.

As an experiment, it might be worthwhile to try -Xptxas -O0 to see whether it makes the other issues go away, too. If you have repro cases for any other issues, feel free to send me a PM and I’ll look into it (starting January 3rd). If you are a registered developer it would be best to file compiler bugs directly (attaching any repro cases) as this will allow you track the progress of the bugs you file.

Did you try to use sm 1.1 or sm1.2 on Fermi? Interesting, that cuda 3.2 does not give a bug on older machines. I suspect it maybe ptx to machine code gen bug. What is about cuda 3.0-3.1 sdk? It also had Fermi support, I recently roll back to them.

I tried with -Xptxas -O0 and my program runs correctly but very slow.

I am a registered developer, now. Should i resubmit the bug report or can i track the progress of this bug otherwise, too?

Thanks and a happy new year,


PS: Sorry for the delay, but my children infected me by a virus (yes, there are some in real word, too!)

Hi Lev,

I tried it here with my complete code (e.g. not the stripped down bug reproduction version presented above):

Result on my 32 bit machine with a GTX 460:

compiled with sm_20 or sm_21: program works correctly - at least it seems to work - on some input cases - today.

compiled with sm_13 or lower: results are incorrect.

Very strange.

On a 64 bit machine with many GTX 460:

compiled with sm_20 or sm_21: results are incorrect

compiled with sm_13 or lower: results are incorrect

Even stranger.

All results with cuda 3.2 (nvcc says release 3.2, Built on November, 3rd)

Only architecture dependent flags to nvcc: -arch=sm_21



I filed a compiler bug on the memcpy vs copy loop issue in December, so there is no need to file a second bug for that. Registered developers can file bugs and have visibility into the status of bugs they filed. Bugs filed from within NVIDIA are visible only inside NVIDIA. If you have repro cases for other issues besides the original one we discussed in December, I would encourage you to file additional bugs.

From further experiments it seems that inserting a __syncthreads() prior to the copy loop fixes the issue in the repro code, i.e.

whirlpool_compress(&ctx, key_salt); // first WHIRLPOOL_BLOCK_SIZE bytes

 __syncthreads();  // <<<<<<<<<<<< 

 for (int i = 0; i < 16; i++) {

  out[i] = ctx.s.state[i];


Maybe this could be a candidate for a workaround for the full app that does not completely destroy performance ?


is there any news on this bug, yet?

Has the bug been comfirmed by the compiler group or did i make any silly mistake?

How could i avoid it? (Downgrading to 3.1, 3.2 or even further, downgrading linux driver or anything else?)

Will i be able to test the fix before it is released (where/when will this happen?)

Thanks for your help

The copy-loop issue is due to a real compiler bug. Other than the workarounds I have suggested, I know of no other workarounds. The workaround of compiling with -Xptxas -O0 is the only workaround that likely works beyond the specific repro app supplied. As you stated, with that switch your code works correctly, but slowly. I do not know whether downgrading to the 3.1 toolchain avoids the compiler issue. Sorry for the inconvenience, and thanks again for reporting this issue and helping us to track it down quickly with the help of the supplied repro app.

Hello njuffa,

i did not find any means to send you a pm (seems that you are missing on the members list) via the forum software.

Please could you contact me by pm regarding my second issue (My other problem on a Fermi machine)?

I assume, it is a different issue, there (that problem seems to be not present in SDK 3.1).

My suspicion is a driver bug triggered by any compiler optimization not present in SDK 3.1.




I don’t have any ideas why there were problems in sending me a PM. I have received multiple PMs from forum participants over the past 24 hours, and sent some PMs myself, so in general the forum’s messaging mechanism appears to work. Unfortunately, now it is I who has a problem, in that I am unable to follow the link you provided. Could you please let me know in which subforum and under which subject you posted your other issue. Thanks!


sorry, i edited the link (double http://)



Thanks, the link works now and I have read the thread. Sorry, that issue is way outside my area of expertise.