How to abort infinite loop CUDA kernel?

Uncle_Joe · February 9, 2010, 11:43pm

I’m using Windows Server 2008 (Vista) and need a way to abort a CUDA call that has gone into an infinite loop. Pressing ctrl-C does nothing. Pressing ctrl-alt-delete to bring up the task manager hangs the machine.

Previously, I just waited 8 seconds for Windows display timeout to abort the program. But after installing Nexus, I had to set TdrLevel = 0.

Please help

scwizzo · February 11, 2010, 9:38pm

For starters, why is it in an infinite loop? Sounds like some bad programming if it’s in an infinite loop. Post code maybe?

Uncle_Joe · February 11, 2010, 9:56pm

I have a disjoint set structure (union/find) which is very sensitive to concurrent merges, which can generate a cycle in what should be a tree, hence any code that tires to follow a path to the root loops endlessly.

My work around for now is to use emulation mode 1st. But emulation mode is going away in CUDA 3.1, so I need a way to abort in non-emulation mode.

Cliff_Woolley · February 19, 2010, 11:24pm

You can use the TDR mechanism for this. Turn your timeout on and just set the timeout period to something long enough (say maybe 30 seconds) that you won’t hit it normally, only when you’re really in an infinite loop.

See the sticky post we just added on this at http://forums.nvidia.com/index.php?showtopic=160277 – although for you, you’d be ENABLING the timeout (which I assume you’ve already disabled or you wouldn’t be having this problem) rather than disabling it as the post discusses. :)

–Cliff

Uncle_Joe · February 19, 2010, 11:35pm

If you read my post carefully, I said I’m using the Nexus Beta Debugger, which requires TDRLevel = 0, (a GPU in debug mode looks hung to OS), so that doesn’t work.

What I would like to do is to not have to extend Window’s desktop onto the Tesla so that Windows won’t bother checking if its hung. But when I do that, the Tesla
is no longer listed by CUDA.

I’m using the GeForce driver, so maybe that’s the cause of the limitation. I haven’t tried the Tesla driver because I was afraid I would brick the OS if the Tesla driver doesn’t
support the other Quadro 295 I use for the monitor. I know Vista’s WDDM version only allows 1 display driver to be loaded. Given my dilemma, I guess I’ll try it and use rollback if it doesn’t work.

Cliff_Woolley · February 23, 2010, 5:35pm

I’m sorry, I did overlook that sentence in your original post.

Actually, I was just told late last week that for Vista and Win7, in fact the TDR timeout applies to all WDDM devices, even if they don’t have a display attached. I realize that the CUDA Toolkit release notes indicated the opposite; this was an error. (We just corrected the release notes in this regard this weekend.)

So clearly you can’t enable the timeout because you need to debug, as you’ve said. (Even if you hypothetically enabled the timeout and set it to a very long timeout period, that still wouldn’t help, because if you’re debugging, you might need an arbitrarily long time, and if you’re stuck in an infinite loop, you want to kick out of it as quickly as possible.)

Instead of using TDR, then, why can’t you use a counter in your kernel that keeps track of how many data structure nodes you have traversed. Pick some threshold of iterations that is way beyond what should happen when everything is working, and if that number is reached, either exit completely or use a conditional breakpoint in the debugger to stop and see what’s going on. So it’s like a timeout, except then it only counts when the kernel is actually actively doing something.

Does that help?

–Cliff

Uncle_Joe · February 23, 2010, 5:54pm

Well, I’ve tried the Tesla driver and it does support Quadro 295, but I still need to extend the desktop. I thought you didn’t need to extend for the Tesla driver?

I also thought about a software timeout. Usually, only while loops (quite rare) are suspect and need a timeout, but it still seems not practical.

One way I thought I could prevent infinite loops is to always 1st run the code in Nexus debug mode before in non-debug mode, which should allow you to abort. But it seems even Nexus has problems aborting an infinite loop?

Cliff_Woolley · February 24, 2010, 10:13pm

I’m not sure I quite follow you. Sure it makes sense about while loops being the problematic ones – because with for() loops, the conditions are usually more deterministic. But why is this not practical? There’s nothing stopping you from adding a loop counter inside a while() loop, even if it’s not used for anything but breaking out of the loop when some unreasonably high number of loop iterations is reached.

I’m not sure what happens in Nexus in this case – I’ll have to follow up with someone on the Nexus team.

Cliff_Woolley · February 24, 2010, 10:21pm

Backing up a bit, I’m still trying to understand how you get into this bad case in the first place. Do you have multiple kernels operating on this data structure? Or is it just one kernel that is updating the structure and reading it, and it’s doing so in a way that is not thread-safe? Or is it in sysmem and the CPU is updating it while the GPU is reading it? I’m just trying to get a fuller understanding of what your implementation is doing.

Uncle_Joe · February 24, 2010, 10:27pm

I’m saying it’s not practical because

I would need to constantly remember to instrument while loops, which isn’t going to happen. So at least I get 1 hang/reboot.

Next, I need to find which loop is hanging, which based on my experience doesn’t happen in emulation mode (Heisenbug). So I get potentially another hang/reboot.

Given these problems, I simply can’t use Nexus.

Uncle_Joe · February 24, 2010, 10:39pm

The problem was identified and resolved in ~1 hour, but I run into these bugs a lot of times, which wastes a lot of time if I have to reboot.

Here’s a case where you generate a cycle in the union/find forest:

I relax down the edge that divides pixels labeled 1 & 2. A cycle will result if at the bottom 1, 2 intersection, I set 1’s parent to 2 and 2’s parent to 1 at the top intersection.

There’s multiple solutions to this problem:

atomic operations - retry when result of atomicCAS isn’t what was expected
always merge right to left, top to bottom
merge blocks a stride at a time: merge (0, 1) (2, 3) (4, 5) (6, 7)… barrier() merge (0, 2) (4, 6) (8, 10)…

Cliff_Woolley · February 25, 2010, 1:02am

Fair enough. I was kind of under the impression that it was one particular loop that was frequently causing you trouble. (If you really want to use this technique, you could write a WHILE() macro to do the instrumentation work for you and get into the habit of using WHILE(). Just a thought. Shrug.)

This makes sense; threads are executed sequentially rather than in parallel when you’re in emulation mode.

But I’m a bit confused by this statement – you don’t need emulation mode in order for Nexus to work.

Are you saying that if you run a (non-emulation-mode) kernel under Nexus and try to hit the pause button while your kernel is in an infinite loop, it doesn’t pause? The engineer I talked to said that this should work.

Uncle_Joe · February 25, 2010, 2:26am

Right. I’m probably making wrong conclusions. I only used Nexus debugger for a few days before going back to emulation mode & TDRLevel = 3 (still using Cuda 3.0)

Some of the problems I had:

Debugger not stopping where I want it. Single step sometimes jumps to random place with no breakpoint instead of next line
Debugger can’t break out of infinite loop (I only remember happening once). Not fair to say this is a problem because I have the bad habit of developing in Release mode.
NVCC optimizer bug (turned out to also be in 2.3)
NVCC pointer analysis changed - NVCC can’t tell if pointers in an array are to shared RAM and assumes it points to global, which broke a lot of code: (single pointers do work)

template <uint BATCH_SIZE>

void Reduce(float *pData[], uint nCount)

{

}

I was too frustrated, so I just threw up my hands and decided to wait until they released the next build. I’ll probably try debug mode again and learn the nuances.

Uncle_Joe · February 26, 2010, 5:27pm

Well, looks like I didn’t RTFM. !Readme_first_Nexus_Beta1_1.0.10013 says

From the Knowledge base article “How do I setup and use Nexus Beta 1’s experimental support for debugging CUDA C programs locally on a single machine with multiple GPUs?”

From the Nexus user guide:

As I envisioned, after not extending the desktop, the Nexus debugger is a lot less temperamental. I’ve made over a dozen debug runs and haven’t encountered a lockup yet.

Cliff_Woolley · February 26, 2010, 5:45pm

Ah, that’s great news! Glad things are going a bit more smoothly for you now. :)

Topic		Replies	Views
CUDA Toolkit 3.0 update GPU HW debugging tools to replace device emulation CUDA Programming and Performance	44	29438	April 29, 2010
CUDA Kernel Execution Timeout on GeForce Trying to turn off the Kernel Timeout on gtx480 for compute CUDA Programming and Performance	16	70599	November 9, 2010
Emulation/CPU=correct,Execution/GPU=incorrect emulation CUDA Programming and Performance	26	21476	September 2, 2008
CUDA limit for loops..? too large number of iterations? CUDA Programming and Performance	28	27376	March 20, 2008
CUDA 2.1 discussion CUDA Programming and Performance	71	63941	February 17, 2009
The Cuda 5 Second execution-time limit Finding a the way to work around the GDI timeout CUDA Programming and Performance	24	12717	July 26, 2010
Simple CUDA program hitting size limits/errors on Windows but not Linux CUDA Programming and Performance	23	1911	January 12, 2019
"Display driver stopped responding and has recovered" WDDM Timeout Detection and Recovery CUDA Programming and Performance	19	160378	February 4, 2012
Error: Failed to suspend device for CUDA device 0 CUDA Programming and Performance	8	4550	January 4, 2023
Infinite loop in CUDA kernel CUDA Programming and Performance	11	15962	October 25, 2010

How to abort infinite loop CUDA kernel?

Related topics