Timeout - restart kernel

Dear all,

i have the following problem:
If i execute a kernel that exceeds its given time i get “the launch timed out and was terminated”.
The code looks like this:
kernel<<<nr_blocks, nr_threads>>>(p_gpu, y);
cerr = cudaGetLastError();
if( cudaSuccess != cerr) {
fprintf(stderr, “ERROR: kernel1 - %s !\n”, cudaGetErrorString(cerr) );

My problem now is, that i can’t do any memory transfer or execute any further kernel
after the failing kernel… what can i do ?

best regards,

I think the problem here is that runtime API contexts are shut down when certain classes of errors occur (as in this case). If you want to continue, I think you will need to open a new context and start from the beginning, because the contents of GPU memory, etc will be lost to your process.

It might be easier to focus on working out some metrics which can predict when a kernel might hit the watchdog timer limit and adjust the kernel execution parameters accordingly so that the context never gets closed in the first place. Alternatively, run on a GPU without an active display, if you have that option available to you.

Thank you for you answer. I could find a metric for not hitting the timeout. However, different cards have a different amount of streaming multiprocessors.
E.g. a Macbook (Pro) has one or two. My Quadro and Tesla Cards have 30. If i write the kernel so small that it fits e.g. a Macbook i add a lot of latency for
my Tesla Card (since it can execute 30 blocks in the same time). I wanted to add a byte-field to the kernel that indicates which blocks finished
calculation. And start over again… however you might be right and i need a new context … but this would imply that also the array (for the blocks)
and the already processed data are gone :(

see code:

[font=“Courier New”]global void kernel(unsigned int *c, unsigned int *yy) {
int bid = blockIdx.x;
int tid = threadIdx.x;

__shared__ int process;

__shared__ unsigned int loop;

if (tid == 0) {
    process = 1;

    if (c[bid]) {
        //fprintf(stderr, " P: %d\n", bid);
        process = 0;
    else {
        //fprintf(stderr, "NP: %d\n", bid);


if (process) {
  // do something


    if (tid == 0) {
        c[bid] = 1;


Another approach would be to see how amenable you algorithm is to processing subsets of the data, so you launch a lot of kernels to process the total input data space (this has the advantage of setting up a lot of the ground work for multi-gpu work if you want to do that later), or to break the algorithm in stages with a separate kernel for each stage, again so each kernel stage keeps under the timeout limit. Kernel launches are pretty low in cost (certainly lower than establishing or switching contexts and/or throwing away partial results when kernels time out).

Whichever way you go, it would seem to make more sense to develop a strategy to avoid timeouts wherever possible, rather than try and code recovery strategies when they happen. The runtime API isn’t really set up for that sort of fault tolerance.