How to abort infinite loop CUDA kernel?

I’m using Windows Server 2008 (Vista) and need a way to abort a CUDA call that has gone into an infinite loop. Pressing ctrl-C does nothing. Pressing ctrl-alt-delete to bring up the task manager hangs the machine.

Previously, I just waited 8 seconds for Windows display timeout to abort the program. But after installing Nexus, I had to set TdrLevel = 0.

Please help

For starters, why is it in an infinite loop? Sounds like some bad programming if it’s in an infinite loop. Post code maybe?

I have a disjoint set structure (union/find) which is very sensitive to concurrent merges, which can generate a cycle in what should be a tree, hence any code that tires to follow a path to the root loops endlessly.

My work around for now is to use emulation mode 1st. But emulation mode is going away in CUDA 3.1, so I need a way to abort in non-emulation mode.

You can use the TDR mechanism for this. Turn your timeout on and just set the timeout period to something long enough (say maybe 30 seconds) that you won’t hit it normally, only when you’re really in an infinite loop.

See the sticky post we just added on this at – although for you, you’d be ENABLING the timeout (which I assume you’ve already disabled or you wouldn’t be having this problem) rather than disabling it as the post discusses. :)


If you read my post carefully, I said I’m using the Nexus Beta Debugger, which requires TDRLevel = 0, (a GPU in debug mode looks hung to OS), so that doesn’t work.

What I would like to do is to not have to extend Window’s desktop onto the Tesla so that Windows won’t bother checking if its hung. But when I do that, the Tesla
is no longer listed by CUDA.

I’m using the GeForce driver, so maybe that’s the cause of the limitation. I haven’t tried the Tesla driver because I was afraid I would brick the OS if the Tesla driver doesn’t
support the other Quadro 295 I use for the monitor. I know Vista’s WDDM version only allows 1 display driver to be loaded. Given my dilemma, I guess I’ll try it and use rollback if it doesn’t work.

I’m sorry, I did overlook that sentence in your original post.

Actually, I was just told late last week that for Vista and Win7, in fact the TDR timeout applies to all WDDM devices, even if they don’t have a display attached. I realize that the CUDA Toolkit release notes indicated the opposite; this was an error. (We just corrected the release notes in this regard this weekend.)

So clearly you can’t enable the timeout because you need to debug, as you’ve said. (Even if you hypothetically enabled the timeout and set it to a very long timeout period, that still wouldn’t help, because if you’re debugging, you might need an arbitrarily long time, and if you’re stuck in an infinite loop, you want to kick out of it as quickly as possible.)

Instead of using TDR, then, why can’t you use a counter in your kernel that keeps track of how many data structure nodes you have traversed. Pick some threshold of iterations that is way beyond what should happen when everything is working, and if that number is reached, either exit completely or use a conditional breakpoint in the debugger to stop and see what’s going on. So it’s like a timeout, except then it only counts when the kernel is actually actively doing something.

Does that help?


Well, I’ve tried the Tesla driver and it does support Quadro 295, but I still need to extend the desktop. I thought you didn’t need to extend for the Tesla driver?

I also thought about a software timeout. Usually, only while loops (quite rare) are suspect and need a timeout, but it still seems not practical.

One way I thought I could prevent infinite loops is to always 1st run the code in Nexus debug mode before in non-debug mode, which should allow you to abort. But it seems even Nexus has problems aborting an infinite loop?

I’m not sure I quite follow you. Sure it makes sense about while loops being the problematic ones – because with for() loops, the conditions are usually more deterministic. But why is this not practical? There’s nothing stopping you from adding a loop counter inside a while() loop, even if it’s not used for anything but breaking out of the loop when some unreasonably high number of loop iterations is reached.

I’m not sure what happens in Nexus in this case – I’ll have to follow up with someone on the Nexus team.

Backing up a bit, I’m still trying to understand how you get into this bad case in the first place. Do you have multiple kernels operating on this data structure? Or is it just one kernel that is updating the structure and reading it, and it’s doing so in a way that is not thread-safe? Or is it in sysmem and the CPU is updating it while the GPU is reading it? I’m just trying to get a fuller understanding of what your implementation is doing.

I’m saying it’s not practical because

I would need to constantly remember to instrument while loops, which isn’t going to happen. So at least I get 1 hang/reboot.

Next, I need to find which loop is hanging, which based on my experience doesn’t happen in emulation mode (Heisenbug). So I get potentially another hang/reboot.

Given these problems, I simply can’t use Nexus.

The problem was identified and resolved in ~1 hour, but I run into these bugs a lot of times, which wastes a lot of time if I have to reboot.

Here’s a case where you generate a cycle in the union/find forest:


1	  2


I relax down the edge that divides pixels labeled 1 & 2. A cycle will result if at the bottom 1, 2 intersection, I set 1’s parent to 2 and 2’s parent to 1 at the top intersection.

There’s multiple solutions to this problem:

  1. atomic operations - retry when result of atomicCAS isn’t what was expected

  2. always merge right to left, top to bottom

  3. merge blocks a stride at a time: merge (0, 1) (2, 3) (4, 5) (6, 7)… barrier() merge (0, 2) (4, 6) (8, 10)…

Fair enough. I was kind of under the impression that it was one particular loop that was frequently causing you trouble. (If you really want to use this technique, you could write a WHILE() macro to do the instrumentation work for you and get into the habit of using WHILE(). Just a thought. Shrug.)

This makes sense; threads are executed sequentially rather than in parallel when you’re in emulation mode.

But I’m a bit confused by this statement – you don’t need emulation mode in order for Nexus to work.

Are you saying that if you run a (non-emulation-mode) kernel under Nexus and try to hit the pause button while your kernel is in an infinite loop, it doesn’t pause? The engineer I talked to said that this should work.

Right. I’m probably making wrong conclusions. I only used Nexus debugger for a few days before going back to emulation mode & TDRLevel = 3 (still using Cuda 3.0)

Some of the problems I had:

  1. Debugger not stopping where I want it. Single step sometimes jumps to random place with no breakpoint instead of next line

  2. Debugger can’t break out of infinite loop (I only remember happening once). Not fair to say this is a problem because I have the bad habit of developing in Release mode.

  3. NVCC optimizer bug (turned out to also be in 2.3)

  4. NVCC pointer analysis changed - NVCC can’t tell if pointers in an array are to shared RAM and assumes it points to global, which broke a lot of code: (single pointers do work)

template <uint BATCH_SIZE>

void Reduce(float *pData[], uint nCount)



I was too frustrated, so I just threw up my hands and decided to wait until they released the next build. I’ll probably try debug mode again and learn the nuances.

Well, looks like I didn’t RTFM. !Readme_first_Nexus_Beta1_1.0.10013 says

From the Knowledge base article “How do I setup and use Nexus Beta 1’s experimental support for debugging CUDA C programs locally on a single machine with multiple GPUs?”

From the Nexus user guide:

As I envisioned, after not extending the desktop, the Nexus debugger is a lot less temperamental. I’ve made over a dozen debug runs and haven’t encountered a lockup yet.

Ah, that’s great news! Glad things are going a bit more smoothly for you now. :)