I’m finding it pretty debilitating for our desktop supercomputers to be limited to a mere 5 seconds of kernel execution time. I’m usually used to hours, days or even weeks of processing on certain kinds of scientific/engineering problems. Has anybody managed to work around this rather severe limitation?
I’ve been reading in about half a dozen forums that the limitation boils down to some trivial windows display driver timeout. The semi-solution of buying a second GPU for dedicated processing is hardly a solution to most people - not least because of the fact that most systems (including mine) have but one PCIex16 Slot. I also find it very wastefull to have half the resources sitting there doing nothing except painting windows. Linux is not an option either. Some of the software i need to hook this up to only runs on windows.
And its a real shame that I can’t at least have that tiny on-board ATI gpu drive the display while dedicating my new Nvidia GPU for the stuff which matters!
I also read that Windows XP SP1 has a means of altering this time-out by setting a large BreakPointDelay value in the registry key:
However, this does not seem to have any effect in my version of WinXP SP3 anyway.
Any ideas? Has nobody tried hacking and butchering the windows GDI yet? There’s got to be a way around this pesky timeout!
I also tried breaking up the Kernel code into multiple shorter pieces. But this is not working either since it appears that the TOTAL execution time is what matters in the end. A series of 20 separate 1 second kernel calls (with different parameters) are merely being treated as if they were one 20 second call :-/
I spent all day hacking at this issue and I’m at a loss… and I’d hate to have to give it up so quickly.
If you absolutely can’t find a way to break up your kernels (perhaps try some different things, and ask around here on the forums if you want some feedback), the best solution is really to just run linux. If you run without a window manager on linux (i.e. from the command line) you can run kernels as long as you like.
You can disable the watchdog timer on windows if you like, but basically due to the underlying driver structure (of windows, not CUDA) after a few seconds the display will freeze until the kernel has finished executing (however long that takes). If you really need to run Windows and you can have a dedicated compute box, perhaps that is a possible solution for you. If you’re planning to distribute your software for commercial purposes (sell it), you should know that Microsoft basically forbids anyone to disable the watchdog in commercial software.
Most kernels that are able to be accelerated by CUDA are also able to be broken up into smaller bits; think about Folding@Home, Badaboom, and so forth…even though their tasks might take minutes or hours, they have still found ways to break their kernels up into individual bits so as not to run afoul of the watchdog timer. Like I said, if you’re new to CUDA, you should ask around the forum for some advice – sometimes there are ways to optimize GPU code that may not be obvious unless you have had to figure them out before ;)
void myKernel(int* INPUTpointer, int* OUTPUTpointer, int N, int Cin)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x; // Calculate a linear index
for( int count = 0; count<REPEATS; count++) // Spending a long time here
{
if (idx < N) OUTPUTpointer[idx] = INPUTpointer[idx] + Cin;
}
As much as I would love to, unfortunately, Linux is not on my list due to other considerations (other software packages that don’t run on linux).
Although it’s ugly, I can live with the screen freezing during computation runs. At the moment, getting this thing to work is way more important than distributing my software. So if there IS a way of killing that GDI watchdog, I’d be the first to want to know how to murder it :-P… Nevermind the display or Microsoft’s paranoias. So the question remains. Has anybody ever managed to do this before?
I agree with you and I’m sure it’s a solved problem, otherwise everbody would have packed up and left the GPGPU business!
However, I’m fairly new but I now got to point where I want to stop playing with toy examples and want to begin doing some serious stuff with it. But I fail to understand how it works for the larger problems, because when I try it, CUDA beats my by lumping together all my 1-second kernel Calls…
Have a look at the code example I posted in the other fork of this thread!
@marcnet: AFAIK multiple kernels each running withing the 5 secs watchdog should be fine. I note that you are experiencing problems with this and I will try the code that you have posted and get back to you. But, fyi - my app can run multiple kernels which together run for more than 5 secs.
At quick glance, try putting a cudaThreadSynchronize or cudaGetLastError after each kernel call.
As long as you are running XP, you can use the TESLA cards which should not have a watchdog limitation since they are not ‘display’ cards. That is my solution to the problem. But note that under Vista, even the Tesla bumps into the watchdog timer and in fact vista has a 2 sec watchdog timer. Isn’t that cool?
Hopefully that problem will be fixed under Windows 7. It’s got a newer (obviously) display driver model than Vista, which allows each card in the machine to have it’s own device. So technically, you could turn off the watchdog for a Tesla and not have it affect your other display cards. Or, you should be able to run an nVidia card and an ATI card in the same machine, which would be awesome for testing stuff like OpenCL.
What happens if you lower your “REPEATS” value to something like 1000? Having it set to 300000 is probably way too much work to be doing in one kernel, since you’re reading from global memory each time (which is fairly slow, compared to the other memory available on the device).
The example given requires an effective device bandwidth of at least 32GB/s to have a fighting chance of finishing a single call within 5 seconds. Do you have that?
Otherwise CUDA will just asynchronously cue up as many kernel calls as you’d like - these are the printf statements you see - but still time out on the first call after 5 seconds.
Too much work? What do you mean? The only reason for doing GPGPU computing is to do a lot of work, no?
After all, I’m not reading new data in the simple test case. I’m just reading the same data over and over again. So if it works for once cycle it should work for any other number, because nothing at all is changing from once cycle to the next!
Moreover, all the calls to global memory are uniform and properly coalesced and in total I am managing an average of about 5.6Gflops of sustained calculations including transfers (when it works!).
To tell you the full truth, I was hoping to run several trillions (and more) of cycles, let alone 300000 of perfectly identical (incredibly simple) calls.
In fact I already lowered the REPEATS value from 3,000,000 to 300,000 to fall within the 5 second limit and then to compensate I increased the number of calls 10 fold.
So in any case, if I lower the REPEATS value further to a 1000, I would have to compensate by increasing the number of calls 300 fold which is most inefficient and still leaves me nowhere better as I have already explained in my previous post.
Launches are asynchronous, so you’re probably timing out on the first kernel you launch. Eventually it fails (long after you enqueue the rest of your kernels), so it only seems like the last kernel is failing.
Yes, I know why we want to use CUDA. I was just asking you to try a lower number so that you could rule out the possibility that the kernel was taking too long (for some reason) and not some other error.
Preface: I know exactly what I’m talking about with regards to watchdog timers because I broke them many times and found out what the hardware and driver are actually doing to implement them.
The watchdog timer only affects you on RM-based OSes (which is to say, not Vista and OS X) if you have a single kernel invocation that takes more than 5s of execution time and there is a display connected*. There is no “cumulative watchdog timer” or however you want to describe it because of how the GPU performs context switching. Once a kernel call has been completed, the GPU is free to respond to the driver which will reset the watchdog timer countdown. As a result, there’s no upper bound on the total amount of time spent in CUDA kernels per process.
All your code snippet demonstrates is that some kernel you launch fails; without a cudaThreadSynchronize() after each, there is no way to determine which one fails. Because launches are asynchronous, they will be enqueued, so there is not really any meaningful penalty to doing 3000 kernel launches** versus 300 versus 30 versus 3***.
if you boot directly to a console in Linux without ever running X, even the GPU used for console output will not have a watchdog timer
** except for the cost of a kernel launch, which on RM platforms is ~8us. time spent in a kernel will dominate for any meaningful amount of work in a kernel
*** it is possible to fill the queue, which will implicitly cause a cudaThreadSynchronize(), so keep that in mind if you’re trying to launch a bazillion kernels and then do CPU-side work
Yes I understand your point… and I appreciate the importance of tryng it out… So I actually tried it and the short answer is that the problem persists.
One short call works well in isolation. But then it fails when it is stacked up with other calls that collectively exceed 5 seconds.
Many thanks for the detailed expert answer - much appreciated!
I just tried inserting a cudaThreadSynchronize() between each call like you say and voila! it solves the problem beautifully (apparently!)
Each kernel call proceeds one after another and they all finish off gracefully giving correct results.
Only this time, instead of shooting all the printf outputs at once and timing out…
… I get one printf output 3.5 seconds apart (as would be expected due to the 3.5s intervals of kernel processing)
So total processing time = 3.5*20 seconds. (at last!)
But as you see, none of then was actually hanging. They were all valid kernel calls and it therefore appears to me to be just the timeout that was getting in the way.
So my questions now turn to…
Is my interpretation correct?
What is the cudaThreadSynchronize() doing in this case?
Is it problematic to use a cudaThreadSynchronize between every call?