Possible reasons for CUDA_ERROR_LAUNCH_TIMEOUT Error 702 when calling cuCtxSynchronize


I am new to CUDA and would like to ask about what the common reasons for getting Error 702 (CUDA_ERROR_LAUNCH_TIMEOUT) are. I am calling many different kernel functions, and sometimes cuLaunchGrid returns CUDA_SUCCESS but everything after (of which the first call is cuCtxSynchronize, returns the mentioned error.

I read this thread,
according to which some possible reasons might be too many registers/thread, too much shared memory/block or that the kernel runtime exceeds the time allowed.

According to my cubin file I am using at most 12 registers/thread in one of my kernel functions, and since I am using blocks which are 1616, this should be fine
16*12 = 3 072 registers/block < 8192, so there is room for more than one block per multiprocessor) as far as I’ve understood.
Also, according to my cubin file I am using at most 40 bytes of shared memory per block (and 40 bytes < 16 kB). I am not allocating any shared memory in my kernel functions.

I have some really simple kernels, like

extern “C”
global void computeRho(float* d_result, float* d_R, float* d_Z, int width, int height)
int index = (blockIdx.x * blockDim.x + threadIdx.x)*width + (blockIdx.y * blockDim.y + threadIdx.y);

d_result[index] = d_R[index] * d_Z[index];


I’ve measured the time needed for the kernel call like this:
unsigned int timer = 0;
CUT_SAFE_CALL( cutCreateTimer( &timer));
CUT_SAFE_CALL( cutStartTimer( timer));

CUresult launchLfunc = cuLaunchGrid( computeL, height / block_size, width / block_size );
printf(“Launch L function: %d\n”, launchLfunc);

CUresult synch = cuCtxSynchronize();
printf(“Managed to synch %d\n”, synch);

CUT_SAFE_CALL( cutStopTimer( timer));
printf(“Processing time computeL: %f (ms)\n”, cutGetTimerValue( timer));
CUT_SAFE_CALL( cutDeleteTimer( timer));

and according to the results, it takes at most 1.4 msec to execute the most demanding kernel, which is way less than 5 seconds (the time allowed by the watchdog mechanism in Windows XP?). These figures apply to when the time out error does not occur (I just added the timer and haven’t got the error yet).

I would really appreciate any ideas on what might cause the time out error.

Thanks in advance,

I got the timeout error again, and apparently one of the kernel functions sometimes takes 10870.394531 ms. to execute, more than 10 seconds!

It seems like these time out errors appear everytime I restart my computer and then run the program. After I’ve run it a couple of times it usually works fine.

The function that happened to produce the error this time (all of them have caused it sometime) looks as follows:

extern “C”
global void computeZRunTime(float* d_Z, float* d_R, int width, int height)

float4 L = tex2D(Ltex, blockIdx.y * blockDim.y + threadIdx.y, blockIdx.x * blockDim.x + threadIdx.x); 
float diag = -(-L.x-L.y-L.z-L.w);
float prec = 1.0f;

if( diag > 0.0001f)
    prec = 1.0f / diag;

d_Z[(blockIdx.x * blockDim.x + threadIdx.x)*width + (blockIdx.y * blockDim.y + threadIdx.y)] = 
    d_R[(blockIdx.x * blockDim.x + threadIdx.x)*width + (blockIdx.y * blockDim.y + threadIdx.y)] * prec;


Could the error be due to divergence caused by the if clause? I have got the same time out error with kernel functions without if clauses though…

I am using a block configuration of 1616 blocks, on a grid of size 1616. According to the cubin file this particular kernel function uses 12 registers/thread and 32 bytes shared memory/block.

I got the another time out error, it takes 13 279.294922 ms. to execute the following function:

extern “C”
global void computeRh1(float* d_result, float* d_R, float* d_Z, int width, int height)
int index = (blockIdx.x * blockDim.x + threadIdx.x)*width + (blockIdx.y * blockDim.y + threadIdx.y);

d_result[index] = d_R[index] * d_Z[index];


I am using a block configuration of 1616 blocks, on a grid of size 1616. This kernel function uses 4 registers/thread and 36 bytes shared memory/block.

You already linked to the post where I describe my bug: have you tried to test your kernels in the same way? Just take 1 kernel that you know causes the problem and call it over and over again 100,000 times, checking for the error each time. In my kernels, I typically see a normally 5ms kernel take 5s and hit the timeout error after only 10,000-20,000 calls.

Your issue may be related, and yet it may be different. I can only replicate the issue in relatively complicated kernels while you seem to have it show up in the simplest ones.

I have been able to “work around” this issue by increasing block sizes and removing the use of textures (which makes my kernels run at less than 1/2 speed :( ). NVIDIA is working on the bug, but I’ve had no updates from them in a while… hopefully it will be fixed in 1.2, whenever that comes out.

On the other hand, maybe I’m jumping to conclusions by assuming that you are having the same issue as me. I have triggered similar issues by accidentally writing past the end of an array on the GPU. Have you allocated a wide enough buffer in your arrays so that “d_result[index] = d_R[index] * d_Z[index];” won’t write past the end of the array due to the block size being a non-multiple of the array size?

Thanks for the reply MisterAnderson42!

I have tried running the kernels 100,000 times, and the results vary… sometimes there’s no timeout at all, and sometimes it happens after only 4-6 iterations (!).

All my arrays are of size 256256 and the block size is 1616, so it should be ok. However, I am using textures in several of the kernels, so I guess I should try removing them next.

So Nvidia has confirmed that there is a bug causing your problems?

My code is pretty complex right now so I will try to strip it down and see at which point I stop getting the errors. Thanks a lot for your help!