Bug appears only when compiling to "release" How to track it down?

Hi,

I have a global kernel that works fine when compiled to debug on Cuda 4.0/Ubuntu 10.10, but when compiled to release, it “crashes”, giving all-zero results. Execution continues normally after that, but with the bogus results carried forward. By commenting out various lines, I’ve traced the problem to two inlined device functions, one of which works fine when called by another global kernel, even in the release version. I tried manually replacing the inlined calls with the code from the functions, but no change. I figured some compiler optimizations were causing my problem, so I turned them off by setting NVCCFLAGS to --compiler-options “-O0” in common.mk. Still no change. It’s down to that “black magic” stage, and I’d appreciate some advice from all you magicians out there. The kernel is pretty long, should I post it?

Thanks,

CRF

Can you try to run the program in cuda-gdb and see what’s the error reported after crashing ?

The program as a whole doesn’t crash, just one kernel, so the debugger reports a normal exit for the whole program. When I run the release version under cuda-gdb, the only diagnostic I get is the logging of kernels invoked. It looks like the kernel never runs, but that probably means it’s crashing before it sends the debugger a notification.

CRF

Kernels usually crash when too many resources are asked. Check the number of threads and blocks is within the limit and that you have enough register to run the kernel with the specified threads per block.

Are you checking return codes from all CUDA function calls? What return codes do you get?

This does seem to be a resource problem. I can run the release version when I reduce the number of threads per block. I’m not sure how the debug version works around this, it would be interesting to know. Anyway, thanks!

CRF

First check the theoretical limit for launching kernels, second count the registers per thread and multiply with the number of threads.

Yes, I did this, then reduced the number of threads and used a 2D grid of blocks to increase the number of blocks to compensate. Now it runs under release, but 3X slower than it did under debug with the larger number of threads per block. So now I’m left wondering how the debug version gets around these resource limitations. At this point I think I’d prefer to just use the debug version, because this routine is the rate determining step of my simulation.

Any info on the debug version’s trick? Thanks,

CRF

Put more details about your program.

How many registers do the debug and release versions of your code use (compile with [font=“Courier New”]-Xptxas -v[/font] to show). If your debug version uses fewer registers, you can use [font=“Courier New”]__launch_bounds()[/font] to force the release version to use the same number of registers (check appendix B.17 of the Programming Guide).

I’m writing a sugarscape simulation, where millions of agents search for food on a million-square grid. Each agent has different capabilities and needs, so I assign one thread per agent and each selects a nearby square by comparing the values (to it) of squares visible to it. The agent-oriented servicing is necessary because the agents move, and it probably doesn’t allow much coalescing. Because in later steps (e.g. mating) agents need to know who their neighbors are, I keep a list of residents for each square. Updating this list for moves requires locking each agent’s old and new square. I’ve hit on a scheme that defers an agent to a secondary queue in case of lock conflicts; bouncing subsequent deferrals between two queues allows all agents who want to move to be serviced eventually. The number of lock conflicts seems to go up dramatically with even a reduction by a factor of two in number of threads per block; this seems to be the cause of the slowdown in my release-capable version.

I’ve checked that, running under debug, all agents are in fact being serviced in my version that exceeds the theoretical resource limits for my card. But I want to know why it works at all…

CRF

Tera, I can run the same code under debug or release (but the version of the kernel with excess threads crashed under release). When I reduced the number of threads per block by a factor of 2, the number of registers went from 37 to 38 - in other words, increased slightly - but stopped crashing. I also tried breaking the rather involved kernel in half, but the second half, where locking occurs, still crashed even though the number of registers decreased to 31. That was less, but not enough to eliminate resource overages without reducing the thread count.

CRF

Thanks for the suggestion! I tried limiting the number of registers using [font=“Courier New”]__launch_bounds()[/font]. Here’s the weird thing: whatever the setting, from 8 registers to the 38 the compiler told me it chose before, now I can make the debug version crash like the release did. The debug version, unfettered by [font=“Courier New”]__launch_bounds()[/font], seems to be the only way the full 1024 threads of my card can be used. Bizarre! At least that tells me a crucial part of the difference between the debug and release compilations.

CRF

What is the answer to my first question: Are you always checking return codes from all CUDA function calls? And you always receive “no error”?

Do you have a display attached to the GPU?

Reading that your code involves locking, I think I’m pretty sure what happens: Your locking code deadlocks, and the kernel launch times out. Whether the deadlock occurs or not can depend on any tiny detail (timing etc.), so it could just as well triggered by the differences between debug and release builds. Locking is very difficult to get right on GPUs, and almost always the wrong solution to the problem.

Yes, from functions that return a status. My kernel returns void because it is a global. Subsequent cuda function calls all return “no error”.

Not the one running the kernels.

Deadlock is not occurring (now). When compiled to debug, the kernel with 1024 threads per block runs sucessfully in less than 100 milliseconds. When compiled to release it crashes even faster. In previous versions I had deadlock (because I was waiting for locks, not deferring); the program would stop responding, just cranking away for minutes. Locking with deferrals is also working in other routines. With all due respect, if you read the rest of my description, you might find that locking is the only solution short of moving agents serially. If you have other suggestions to the problem of assigning and updating unique queue addresses to the agents on each square, I would be glad to hear them, but dogmatic responses are not useful.

CRF

I figure I missed an important sentence in your initial post:

Yes, please do so. While of course I can’t guarantee we’ll be able to find the problem, it will certainly lead to a more useful discussion than pure second-guessing.

Another question: Have you run the program under cuda-memcheck?