Heisenbug in CUDA 5.x?

I’m launching a single block of one warp and I’m seeing divergence where it never should happen.

It doesn’t happen with debug compiles or when forceinline is applied to all device functions.

Clearly I would like to file a bug but it has been a challenge to locate it because it just moves or goes away if I corner it.

After trying other methods of locating where the problem was, I started using a snippet of code like this to trap when divergence appears:

diverged(const unsigned int lineNum) // lineNum arg is __LINE__
   const unsigned int lanes = __ballot(true);

   if (lanes != 0xFFFFFFFF)
       trapClock[laneId()] = clock();
       trapLines[laneId()] = lineNum;

It reveals that there is divergence but I still face the same problem that instrumenting the code with enough of these ballots winds up either fixing the bug or the issue moving to a new line number (obtained via LINE).

The short summary of the problematic routine is that it’s an infinite loop that is exited under certain conditions by returning a bool. It’s a fairly large amount of code.

Anyway, I’m wondering if anyone else has hunted a bug like this and has any tips. :)

I want to assume that it’s an error in my code but… I would not be surprised if it’s a compiler bug (missing SSY?) given the difficulty in tracking it down.

Again, it doesn’t appear that filing a bug is possible if I can’t isolate it. :|

Have you checked the SSYs in the code or is the code too large to (reasonably) to that?

Are all threads exiting the loop at the same time or is there divergence that you expect to reconverge later on?

The code is too large to reasonably try to make sense of SSY’s. I’m not sure how I would begin since I don’t have a full understanding of when the SSY opcode should appear. My assumption is that SSY is a rejoin point for divergent code?

The expectation is that all threads enter and exit the loop together.

It works fine in debug and it works fine if I pepper enough divergence() tests in my code.

Reducing the optimization level to ‘0’ doesn’t help.

cuda-memcheck has been helpful but is finding exceptions too late. cuda-racecheck bleats a lot about some stylistic RaW and WaW warp-synchronous writes to shared but otherwise doesn’t reveal anything.

I don’t think it’s a debug vs. release memory initialization issue but… I hope it’s something as simple as that.

I’m just hoping that someone else can suggest some medieval but effective debugging techniques that might work!

That is my understanding of SSY as well. So there should be a SSY before every conditional branch that does not stem from a bra.uni.
I thought that you might look only at the SSY and BRA instructions and skip everything between.

Thanks @tera, I will try inspecting the SSY ops.

Good luck!

You can make racecheck suppress the messages about warp level programming by using “–print-level error”. Could you post an example of the sort of exception cuda-memcheck reporting ?

Can you make a reproducer?

@vyas – I had already tried lowering racecheck’s error level and it didn’t lead me to any new conclusions. The memcheck failure I’m receiving is due to the divergence. :|

@vvolkov – I’m trying to but it feels like it won’t be possible. I’ve shrunk the interesting parts of the routine to about 4000 SASS instructions so I have a long way to go before it’s understandable. I’ve been inspecting the SSY’s and there are just too many to evaluate unless I can make a micro example.

I’ve burned several days on this so I’m thinking that the while { spin in some cases or return bool in 4 cases } is giving ptxas trouble.

I have no choice but to make it work so… I’ll keep at it. :)

allanmac, If you have the code running on Windows you can run Nsight VSE Instruction Count and Branch Statistics experiments which collect per SASS line statistics and try to roll the data up to PTX and high level source if you compiled with -G or -lineinfo. The Nsight VSE Analysis Source View will show the correlation between SASS to PTX to high level source.

Ah, I forgot about the branch statistics . I will try it out!

“The memcheck failure I’m receiving is due to the divergence. :|” Can you explain that a bit more? I can’t think of an example where divergence would lead to illegal memory accesses.

Yes, that’s a good question because normally divergence and memory exceptions are unrelated topics. :)

In this case, prior divergence resulted in a custom memory allocator of just a few instructions not being executed because it was masked to run on lane 0. The other lanes arrived first and acquired a bad reference.

Greg@NV, the Source-Level experiments followed by Source View is really powerful.

Still digging through it but having access to all the branch statistics is priceless.


Any luck on fixing this? I seem to have run into the same problem. Again, it seems to be related to calls to non-inline functions, and seems to happen when there’s complex code (e.g. doubly nested if-statements), and looks like it’s missing SSY instructions. It happens for me with ptxas versions 4.2 and 5.0.

Someone on stackoverflow has reproduced a simple example which appears to be the same problem, and will run in sm_21. Interestingly, it happens when there is only 1 thread:


No, I never had any luck fixing it through analysis.

I wound up burning at least a week or two of time. Eventually I threw my hands up and significantly altered the control structure of the code… and the bug disappeared.

It was a huge time suck.

It is indeed very frustrating, especially since it seems to be a bit random. For me, it was only happening on the 3rd iteration of an unrolled loop even though the 2nd and 3rd iterations were identical.

I was able to fix my own code by putting contents of the loop inside this “no-op” if statement:

if(threadIdx.x<1024) { … }

It doesn’t get optimized away (cuda 4.2), and it seems to force it to put in the correct SSY instructions, which are now clearly matching up with conditional BRA’s. I’m curious if this would fix your code as well.