Timeout - restart kernel

hhacker · December 14, 2009, 2:34pm

Dear all,

i have the following problem:
If i execute a kernel that exceeds its given time i get “the launch timed out and was terminated”.
The code looks like this:
kernel<<<nr_blocks, nr_threads>>>(p_gpu, y);
cudaThreadSynchronize();
cerr = cudaGetLastError();
if( cudaSuccess != cerr) {
fprintf(stderr, “ERROR: kernel1 - %s !\n”, cudaGetErrorString(cerr) );
}

My problem now is, that i can’t do any memory transfer or execute any further kernel
after the failing kernel… what can i do ?

best regards,
Hans

avidday · December 14, 2009, 2:44pm

I think the problem here is that runtime API contexts are shut down when certain classes of errors occur (as in this case). If you want to continue, I think you will need to open a new context and start from the beginning, because the contents of GPU memory, etc will be lost to your process.

It might be easier to focus on working out some metrics which can predict when a kernel might hit the watchdog timer limit and adjust the kernel execution parameters accordingly so that the context never gets closed in the first place. Alternatively, run on a GPU without an active display, if you have that option available to you.

hhacker · December 14, 2009, 3:55pm

Thank you for you answer. I could find a metric for not hitting the timeout. However, different cards have a different amount of streaming multiprocessors.
E.g. a Macbook (Pro) has one or two. My Quadro and Tesla Cards have 30. If i write the kernel so small that it fits e.g. a Macbook i add a lot of latency for
my Tesla Card (since it can execute 30 blocks in the same time). I wanted to add a byte-field to the kernel that indicates which blocks finished
calculation. And start over again… however you might be right and i need a new context … but this would imply that also the array (for the blocks)
and the already processed data are gone :(

see code:

[font=“Courier New”]global void kernel(unsigned int *c, unsigned int *yy) {
int bid = blockIdx.x;
int tid = threadIdx.x;

__shared__ int process;

__shared__ unsigned int loop;

if (tid == 0) {
    process = 1;

    if (c[bid]) {
        //fprintf(stderr, " P: %d\n", bid);
        process = 0;
    }
    else {
        //fprintf(stderr, "NP: %d\n", bid);
    }
}

__syncthreads();

if (process) {
  // do something

    __syncthreads();

    if (tid == 0) {
        c[bid] = 1;
    }
}

}[/font]

avidday · December 14, 2009, 4:06pm

Another approach would be to see how amenable you algorithm is to processing subsets of the data, so you launch a lot of kernels to process the total input data space (this has the advantage of setting up a lot of the ground work for multi-gpu work if you want to do that later), or to break the algorithm in stages with a separate kernel for each stage, again so each kernel stage keeps under the timeout limit. Kernel launches are pretty low in cost (certainly lower than establishing or switching contexts and/or throwing away partial results when kernels time out).

Whichever way you go, it would seem to make more sense to develop a strategy to avoid timeouts wherever possible, rather than try and code recovery strategies when they happen. The runtime API isn’t really set up for that sort of fault tolerance.

Topic		Replies	Views
CUDA kernel timeout CUDA Programming and Performance	12	58637	December 22, 2022
the launch timed out and was terminated. CUDA Programming and Performance	6	23781	June 29, 2010
Error on iteration of cuda kernel CUDA Programming and Performance	4	4341	July 11, 2011
cudaErrorLaunchTimeout and CUDA2.0 CUDA Programming and Performance	4	2110	July 2, 2008
CUDA Timeout? CUDA Programming and Performance	7	27666	December 19, 2011
question about "launch timed out" CUDA Programming and Performance	2	1388	April 24, 2009
cudaErrorLaunchTimeout error - how to repair after it happens ? CUDA Programming and Performance	1	1505	November 21, 2010
Realtime kernel CUDA Programming and Performance	12	2273	September 13, 2010
Launch Timeouts CUDA Programming and Performance	32	21784	May 4, 2011
Concurrent Kernels Bug / Undocumented Behavior (Urgent) need info on "simple" problem with c CUDA Programming and Performance	2	905	June 18, 2010

Timeout - restart kernel

Related topics