compiler bugs, caveats

While developing several projects with CUDA i encountered many problems, some compiler bugs, some different-then-C-compiler behavior, some completely untraceable bugs.
I had a hard time figuring them out, as I sometimes didn’t know where and what to search for. It was sometimes a real pain and took alot of time.
In this thread I would like the CUDA community to post every_bug_out_there. Every “wtf”, every odd behaviour, etc.
Let it be documented in one place.
Please post along code examples that cause the bug, what it should do, but doesn’t (or what it does, but shouldn’t) , how to fix or workaround it…
Or post a thread where you posted your bug before.
Lets help new devs ;)

to get something started:
[topic=“166687”]unexpected loop unrolling[/topic]
[topic=“166681”]compiler problems with 64bit datatype and logical instructions[/topic]

Uh… in my one-year experience with CUDA I stumbled on many, many strange errors. Some are funny, some caused me a headache for several days or even weeks, stopping completely my progress…
Here is my list (from newest to oldest).
Unfortunately, in most cases I have no idea about the progress resolving the problem - is anyone working on it, or was it completely ignored? … I am worried, the latter is the case.
Anyway, have fun reading!

unresolved bug: compiler crash — Run out of registers in predicate (CUDA 3.0)

bug + partial walkaround: compile failure — device cannot be called from host device despite CUDA_ARCH guarding (CUDA 3.0)

bug + walkaround: Compiler’s stack overflow. (CUDA 2.3, maybe not in 3.0 anymore)

bug: cudaMemcpyToSymbol working differently when in inline function and when in macro

bug: nvcc or VS issue? Inline function not included in partial compilation. No response at all :(

bug: Compile error with an array of objects. Not tested with CUDA 3.0

bug: Unnecessary local memory usage with int4 type. Not much response…

driver? bug: Inneficient block scheduling. Ridiculously off-topic response from tmurray :(

bug: Incorrect assignment to global variable in a small conditional branch

compiler or hardware bug: Branch that is never taken causes __syncthreads() de-synchronisation. Not much response :(

strange behaviour: “White gaps” GPU staying idle for no reason. Seems number of variables has impact on it.

problem: Using textures in several .cu files, even if each texture is being used only in one file

mild bug: Loop not unrolling. Manual unroll necessary

compiler bug: Variable visible before its declaration

From my experience, each single bug I’ve opened via the online bug report site (not from the forums) was handled in a very professional way.

Most of the time nVidia’s QA wanted a simple repro, but once there was a repro, they checked and indeed fixed the problem in the very next version.

I got updates about the progress (going to QA, going to R&D, check in QA, closed/fixed in next version…) via emails.

Maybe you should try to open those bugs in their bug tracking system and not here in the forums.

my 1 cent


Probably you are right… I thought both ways are more-or-less equal.