The Cuda 5 Second execution-time limit Finding a the way to work around the GDI timeout

I’m finding it pretty debilitating for our desktop supercomputers to be limited to a mere 5 seconds of kernel execution time. I’m usually used to hours, days or even weeks of processing on certain kinds of scientific/engineering problems. Has anybody managed to work around this rather severe limitation?

I’ve been reading in about half a dozen forums that the limitation boils down to some trivial windows display driver timeout. The semi-solution of buying a second GPU for dedicated processing is hardly a solution to most people - not least because of the fact that most systems (including mine) have but one PCIex16 Slot. I also find it very wastefull to have half the resources sitting there doing nothing except painting windows. Linux is not an option either. Some of the software i need to hook this up to only runs on windows.

And its a real shame that I can’t at least have that tiny on-board ATI gpu drive the display while dedicating my new Nvidia GPU for the stuff which matters!

I also read that Windows XP SP1 has a means of altering this time-out by setting a large BreakPointDelay value in the registry key:


However, this does not seem to have any effect in my version of WinXP SP3 anyway.
Any ideas? Has nobody tried hacking and butchering the windows GDI yet? There’s got to be a way around this pesky timeout!

I also tried breaking up the Kernel code into multiple shorter pieces. But this is not working either since it appears that the TOTAL execution time is what matters in the end. A series of 20 separate 1 second kernel calls (with different parameters) are merely being treated as if they were one 20 second call :-/

I spent all day hacking at this issue and I’m at a loss… and I’d hate to have to give it up so quickly.

Anyone has any ideas? Experiences? Solutions?

If you absolutely can’t find a way to break up your kernels (perhaps try some different things, and ask around here on the forums if you want some feedback), the best solution is really to just run linux. If you run without a window manager on linux (i.e. from the command line) you can run kernels as long as you like.

You can disable the watchdog timer on windows if you like, but basically due to the underlying driver structure (of windows, not CUDA) after a few seconds the display will freeze until the kernel has finished executing (however long that takes). If you really need to run Windows and you can have a dedicated compute box, perhaps that is a possible solution for you. If you’re planning to distribute your software for commercial purposes (sell it), you should know that Microsoft basically forbids anyone to disable the watchdog in commercial software.

Most kernels that are able to be accelerated by CUDA are also able to be broken up into smaller bits; think about Folding@Home, Badaboom, and so forth…even though their tasks might take minutes or hours, they have still found ways to break their kernels up into individual bits so as not to run afoul of the watchdog timer. Like I said, if you’re new to CUDA, you should ask around the forum for some advice – sometimes there are ways to optimize GPU code that may not be obvious unless you have had to figure them out before ;)

This statement is incorrect. You can run as many execution of the kernel, as long as each kernel call takes less than 5sec.



If only it were as simple you say!..

Now try the following code example:

[codebox]#include <stdio.h>

#include <cutil_inline.h>

//BreakPointDelay set to 40960 Seconds = 11.37 Hours

#define BLOCK_SIZE 256 // Thread block size

#define REPEATS 300000 // Thread block size


void myKernel(int* INPUTpointer, int* OUTPUTpointer, int N, int Cin)


int idx = blockIdx.x * blockDim.x + threadIdx.x;	// Calculate a linear index

for( int count = 0; count<REPEATS; count++)			// Spending a long time here


	if (idx < N)  OUTPUTpointer[idx] = INPUTpointer[idx] + Cin;



struct GPUTimingsSTRUCT


float TotalTime;

float MemcopyCPU2GPU;

float MemcopyGPU2CPU;

float GPU_KERNEL_1;





int N = 65536;

int* DATA1_d; // *DATA1_d is the input vector of integers stored in device memory

int* DATA2_d; // *DATA2_d is the output vector of integers stored in device memory

int* DATA1_h; // *DATA1_h is the input vector of integers stored in host memory

int* DATA2_h; // *DATA2_h is the output vector of integers stored in host memory

GPUTimingsSTRUCT GPUTimings;

// Define Execution Configuration

dim3 dimBlock;

	dimBlock.x = BLOCK_SIZE;

	dimBlock.y = 1;

	dimBlock.z = 1;

dim3 dimGrid;

	dimGrid.x = 256;

	dimGrid.y = 1;

	dimGrid.z = 1;

size_t dynShared = 0;

cudaEvent_t Event1, Event2, Event3, Event4, Event5;






// Allocate vectors and variables in device memory

cutilSafeCallNoSync(cudaMalloc((void**)&DATA1_d, N*sizeof(int)));

cutilSafeCallNoSync(cudaMalloc((void**)&DATA2_d, N*sizeof(int)));

// Allocate vectors in host memory

DATA1_h = (int*)malloc(N*sizeof(int));

DATA2_h = (int*)malloc(N*sizeof(int));

// Copy vectors from host memory to device memory

cudaEventRecord(Event1, 0); cudaEventSynchronize(Event1);

cutilSafeCallNoSync(cudaMemcpy(DATA1_d, DATA1_h, N*sizeof(int), cudaMemcpyHostToDevice));

cudaEventRecord(Event2, 0); cudaEventSynchronize(Event2);

// Invoke broken-up kernels

myKernel<<<dimBlock, dimGrid, dynShared, 0>>>(DATA1_d, DATA2_d, N, 0);

cudaEventRecord(Event3, 0); cudaEventSynchronize(Event3);

printf("STEP 1 ok!\n");

myKernel<<<dimBlock, dimGrid, dynShared, 0>>>(DATA1_d, DATA2_d, N, 1);

printf("STEP 2 ok!\n");

myKernel<<<dimBlock, dimGrid, dynShared, 0>>>(DATA1_d, DATA2_d, N, 2);

printf("STEP 3 ok!\n");

//myKernel<<<dimBlock, dimGrid, dynShared, 0>>>(DATA1_d, DATA2_d, N, 3);

// printf(“STEP 4 ok!\n”);

//myKernel<<<dimBlock, dimGrid, dynShared, 0>>>(DATA1_d, DATA2_d, N, 4);

// printf(“STEP 5 ok!\n”);

//myKernel<<<dimBlock, dimGrid, dynShared, 0>>>(DATA1_d, DATA2_d, N, 5);

// printf(“STEP 6 ok!\n”);

//myKernel<<<dimBlock, dimGrid, dynShared, 0>>>(DATA1_d, DATA2_d, N, 6);

// printf(“STEP 7 ok!\n”);

//myKernel<<<dimBlock, dimGrid, dynShared, 0>>>(DATA1_d, DATA2_d, N, 7);

// printf(“STEP 8 ok!\n”);

//myKernel<<<dimBlock, dimGrid, dynShared, 0>>>(DATA1_d, DATA2_d, N, 8);

// printf(“STEP 9 ok!\n”);

//myKernel<<<dimBlock, dimGrid, dynShared, 0>>>(DATA1_d, DATA2_d, N, 9);

// printf(“STEP 10 ok!\n”);

//myKernel<<<dimBlock, dimGrid, dynShared, 0>>>(DATA1_d, DATA2_d, N, 10);

// printf(“STEP 11 ok!\n”);

//myKernel<<<dimBlock, dimGrid, dynShared, 0>>>(DATA1_d, DATA2_d, N, 11);

// printf(“STEP 12 ok!\n”);

//myKernel<<<dimBlock, dimGrid, dynShared, 0>>>(DATA1_d, DATA2_d, N, 12);

// printf(“STEP 13 ok!\n”);

//myKernel<<<dimBlock, dimGrid, dynShared, 0>>>(DATA1_d, DATA2_d, N, 13);

// printf(“STEP 14 ok!\n”);

//myKernel<<<dimBlock, dimGrid, dynShared, 0>>>(DATA1_d, DATA2_d, N, 14);

// printf(“STEP 15 ok!\n”);

//myKernel<<<dimBlock, dimGrid, dynShared, 0>>>(DATA1_d, DATA2_d, N, 15);

// printf(“STEP 16 ok!\n”);

//myKernel<<<dimBlock, dimGrid, dynShared, 0>>>(DATA1_d, DATA2_d, N, 16);

// printf(“STEP 17 ok!\n”);

//myKernel<<<dimBlock, dimGrid, dynShared, 0>>>(DATA1_d, DATA2_d, N, 17);

// printf(“STEP 18 ok!\n”);

//myKernel<<<dimBlock, dimGrid, dynShared, 0>>>(DATA1_d, DATA2_d, N, 18);

// printf(“STEP 19 ok!\n”)

//myKernel<<<dimBlock, dimGrid, dynShared, 0>>>(DATA1_d, DATA2_d, N, 19);

// printf(“STEP 20 ok!\n”);

cudaEventRecord(Event4, 0); cudaEventSynchronize(Event4);

// Copy vectors from device memory to host memory

cutilSafeCallNoSync(cudaMemcpy(DATA2_h, DATA2_d, N*sizeof(int), cudaMemcpyDeviceToHost));

cudaEventRecord(Event5, 0); cudaEventSynchronize(Event5);

cudaEventElapsedTime(&GPUTimings.TotalTime, Event1, Event5);

cudaEventElapsedTime(&GPUTimings.MemcopyCPU2GPU, Event1, Event2);

cudaEventElapsedTime(&GPUTimings.GPU_KERNEL_1, Event2, Event3);

cudaEventElapsedTime(&GPUTimings.GPU_KERNEL_ALL, Event3, Event4);

cudaEventElapsedTime(&GPUTimings.MemcopyGPU2CPU, Event4, Event5);

printf("GPUTimings.TotalTime        = %f\n", GPUTimings.TotalTime);

printf("GPUTimings.Memcopy CPU2GPU  = %f\n", GPUTimings.MemcopyCPU2GPU);

printf("GPUTimings.GPU_KERNEL_1     = %f\n", GPUTimings.GPU_KERNEL_1);

printf("GPUTimings.GPU_KERNEL_ALL   = %f\n", GPUTimings.GPU_KERNEL_ALL);

printf("GPUTimings.Memcopy GPU2CPU  = %f\n", GPUTimings.MemcopyGPU2CPU);

printf("GPU Additions /Sec = %f\n", (1000.0*(float)REPEATS*(float)N)/(GPUTimings.GPU_KERNEL_1));


// cleanup











It works fine as I pasted it… However, try uncommenting more of the kernel calls, and soon enough, you will run into problems!

And it actually still stops in about 5 seconds no matter how many kernel calls you ask it to do.

This is the output I get:


STEP 1 ok!

STEP 2 ok!

STEP 3 ok!

STEP 4 ok!

STEP 5 ok!

STEP 6 ok!

STEP 7 ok!

STEP 8 ok!

STEP 9 ok!

STEP 10 ok!

STEP 11 ok!

STEP 12 ok!

STEP 13 ok!

STEP 14 ok!

STEP 15 ok!

STEP 16 ok!

STEP 17 ok!

STEP 18 ok!

STEP 19 ok!

STEP 20 ok!

cudaSafeCallNoSync() Runtime API error in file ‘’ in line 131 : the launch timed out and was terminated. …"

The error is exactly the same as when I try to run a 100 second Kernel!

So again I ask… What is the problem here?

As much as I would love to, unfortunately, Linux is not on my list due to other considerations (other software packages that don’t run on linux).

Although it’s ugly, I can live with the screen freezing during computation runs. At the moment, getting this thing to work is way more important than distributing my software. So if there IS a way of killing that GDI watchdog, I’d be the first to want to know how to murder it :-P… Nevermind the display or Microsoft’s paranoias. So the question remains. Has anybody ever managed to do this before?

I agree with you and I’m sure it’s a solved problem, otherwise everbody would have packed up and left the GPGPU business!

However, I’m fairly new but I now got to point where I want to stop playing with toy examples and want to begin doing some serious stuff with it. But I fail to understand how it works for the larger problems, because when I try it, CUDA beats my by lumping together all my 1-second kernel Calls…

Have a look at the code example I posted in the other fork of this thread!

I would be glad if anybody had a few hints…

@marcnet: AFAIK multiple kernels each running withing the 5 secs watchdog should be fine. I note that you are experiencing problems with this and I will try the code that you have posted and get back to you. But, fyi - my app can run multiple kernels which together run for more than 5 secs.

At quick glance, try putting a cudaThreadSynchronize or cudaGetLastError after each kernel call.

As long as you are running XP, you can use the TESLA cards which should not have a watchdog limitation since they are not ‘display’ cards. That is my solution to the problem. But note that under Vista, even the Tesla bumps into the watchdog timer and in fact vista has a 2 sec watchdog timer. Isn’t that cool?

Hopefully that problem will be fixed under Windows 7. It’s got a newer (obviously) display driver model than Vista, which allows each card in the machine to have it’s own device. So technically, you could turn off the watchdog for a Tesla and not have it affect your other display cards. Or, you should be able to run an nVidia card and an ATI card in the same machine, which would be awesome for testing stuff like OpenCL.

What happens if you lower your “REPEATS” value to something like 1000? Having it set to 300000 is probably way too much work to be doing in one kernel, since you’re reading from global memory each time (which is fairly slow, compared to the other memory available on the device).

The example given requires an effective device bandwidth of at least 32GB/s to have a fighting chance of finishing a single call within 5 seconds. Do you have that?

Otherwise CUDA will just asynchronously cue up as many kernel calls as you’d like - these are the printf statements you see - but still time out on the first call after 5 seconds.

Too much work? What do you mean? The only reason for doing GPGPU computing is to do a lot of work, no?

After all, I’m not reading new data in the simple test case. I’m just reading the same data over and over again. So if it works for once cycle it should work for any other number, because nothing at all is changing from once cycle to the next!

Moreover, all the calls to global memory are uniform and properly coalesced and in total I am managing an average of about 5.6Gflops of sustained calculations including transfers (when it works!).

To tell you the full truth, I was hoping to run several trillions (and more) of cycles, let alone 300000 of perfectly identical (incredibly simple) calls.

In fact I already lowered the REPEATS value from 3,000,000 to 300,000 to fall within the 5 second limit and then to compensate I increased the number of calls 10 fold.

So in any case, if I lower the REPEATS value further to a 1000, I would have to compensate by increasing the number of calls 300 fold which is most inefficient and still leaves me nowhere better as I have already explained in my previous post.

Launches are asynchronous, so you’re probably timing out on the first kernel you launch. Eventually it fails (long after you enqueue the rest of your kernels), so it only seems like the last kernel is failing.

Yes I understand that CUDA is simply asynchronously queuing up as many kernel calls as I issue but the first calls work well!

So much so that a single call (300,000 loops long) works just fine and takes 3458.93 ms — well within the 5 seconds!

In fact BandwidthTest.exe quotes a device bandwidth of 41,493.5MB per second… which as you say, is more than enough!

Yes I understand that CUDA launches are asynchronous. This is my problem in fact and my code snippet demonstrates it.

This is why 1 huge multi-minute call and lots of small sub-5second calls are equivalent in terms of the 5 second limit.

A single call (300,000 loops long) works great and takes 3458.93 ms (on my GPU) — which is well within the 5 seconds!

It’s a sequence of such calls that fails miserably!

So how on earth can we get around this vexing problem?

Well that offers a ray of hope… but in the meantime it leaves us mortals (running WinXP and Vista) out in the cold :unsure:

Yes, I know why we want to use CUDA. I was just asking you to try a lower number so that you could rule out the possibility that the kernel was taking too long (for some reason) and not some other error.

Preface: I know exactly what I’m talking about with regards to watchdog timers because I broke them many times and found out what the hardware and driver are actually doing to implement them.

The watchdog timer only affects you on RM-based OSes (which is to say, not Vista and OS X) if you have a single kernel invocation that takes more than 5s of execution time and there is a display connected*. There is no “cumulative watchdog timer” or however you want to describe it because of how the GPU performs context switching. Once a kernel call has been completed, the GPU is free to respond to the driver which will reset the watchdog timer countdown. As a result, there’s no upper bound on the total amount of time spent in CUDA kernels per process.

All your code snippet demonstrates is that some kernel you launch fails; without a cudaThreadSynchronize() after each, there is no way to determine which one fails. Because launches are asynchronous, they will be enqueued, so there is not really any meaningful penalty to doing 3000 kernel launches** versus 300 versus 30 versus 3***.

  • if you boot directly to a console in Linux without ever running X, even the GPU used for console output will not have a watchdog timer

** except for the cost of a kernel launch, which on RM platforms is ~8us. time spent in a kernel will dominate for any meaningful amount of work in a kernel

*** it is possible to fill the queue, which will implicitly cause a cudaThreadSynchronize(), so keep that in mind if you’re trying to launch a bazillion kernels and then do CPU-side work

Yes I understand your point… and I appreciate the importance of tryng it out… So I actually tried it and the short answer is that the problem persists.

One short call works well in isolation. But then it fails when it is stacked up with other calls that collectively exceed 5 seconds.

So what happens if you run, say, the nbody sample in the SDK and leave it running for twenty minutes?

The Kudos goes to you my friend! (no pun intended!) You’ve just solved the problem…

A cudaThreadSynchronize() between each call solves the problem beautifully!

I guess working in 5 second batches, although a little inefficient in terms of overheads is not too bad afterall.

So Thanks a Million!

…Pity this crucial trick was never mentioned anywhere in the documentation…

Many thanks for the detailed expert answer - much appreciated!

I just tried inserting a cudaThreadSynchronize() between each call like you say and voila! it solves the problem beautifully (apparently!)

Each kernel call proceeds one after another and they all finish off gracefully giving correct results.

Only this time, instead of shooting all the printf outputs at once and timing out…

… I get one printf output 3.5 seconds apart (as would be expected due to the 3.5s intervals of kernel processing)

So total processing time = 3.5*20 seconds. (at last!)

But as you see, none of then was actually hanging. They were all valid kernel calls and it therefore appears to me to be just the timeout that was getting in the way.

So my questions now turn to…

  • Is my interpretation correct?

  • What is the cudaThreadSynchronize() doing in this case?

  • Is it problematic to use a cudaThreadSynchronize between every call?

  • Anyhting else I should worry about?