the launch timed out and was terminated.

panosdim · May 29, 2010, 7:15pm

Hi all,

i try to run the following kernel in a device capability 1.2 GPU but i get the following error

cuda_src/main.cu(198) : cudaSafeCall() Runtime API error : the launch timed out and was terminated.

From the ptx flag i get the following

ptxas info : Compiling entry function ‘Z15ac_searchKernelPPcPiPS1_S1_S1_S1’ for ‘sm_10’

ptxas info : Used 8 registers, 2048+0 bytes lmem, 2072+16 bytes smem, 16 bytes cmem[1]

cuda_src/main.cu(26): warning: variable “d_ac_results” was declared but never referenced

As far i can see there is no problem with the registers as i use 2 blocks of 256 threads each block.

Can anyone help me to find the error causing this?

[codebox]#include <stdio.h>

#ifndef MAX_RESULTS

#define MAX_RESULTS 320

#endif

#ifndef NUMOFPACKETS

#define NUMOFPACKETS 512

#endif

global void ac_searchKernel(char **g_packets, int *g_packetLen, int **g_ac_goto, int *g_ac_fail, int *g_ac_output, int *g_ac_results)

{

int i, tid, state[NUMOFPACKETS];

__shared__ int numOfResults[NUMOFPACKETS];

tid = blockIdx.x * blockDim.x + threadIdx.x;

state[tid] = 0;

numOfResults[tid] = 0;

__syncthreads();

for (i = 0; i < *g_packetLen; i++) {

	while (g_ac_goto[state[tid]][g_packets[tid][i]] == FAIL)

		state[tid] = g_ac_fail[state[tid]];

	state[tid] = g_ac_goto[state[tid]][g_packets[tid][i]];

	if (g_ac_output[state[tid]] == 1) {

		if (numOfResults[tid] != MAX_RESULTS) {

		//  g_ac_results[tid][numOfResults[tid]] = i;

		//  g_ac_results[tid][numOfResults[tid] + 1] = state[tid];

			numOfResults[tid] = numOfResults[tid] + 1;				

		}			

	}

}

__syncthreads();

g_ac_results[tid] = numOfResults[tid];

}[/codebox]

avidday · May 29, 2010, 7:28pm

The error means that the kernel is taking more than 5 seconds to run, so a device driver protection mechanism is killing the kernel. This happens on GPUs which have a display manager on them. If you don’t run X11 on the card, you will probably find it works normally. If you need to use the card for display, your only choices are either make the code faster, or reduce the amount of work the kernel does per launch to keep the execution time to less than 5 seconds.

panosdim · May 30, 2010, 8:41am

I try to reduce the execution time of the kernel by doing the following change

[codebox]global void ac_searchKernel(char **g_packets, int *g_packetLen, int **g_ac_goto, int *g_ac_fail, int *g_ac_output, int *g_ac_results)

{

int i, tid, state[NUMOFPACKETS];

__shared__ int numOfResults[NUMOFPACKETS];

tid = blockIdx.x * blockDim.x + threadIdx.x;

state[tid] = 0;

numOfResults[tid] = 0;

for (i = 0; i < *g_packetLen; i++) {

	while (g_ac_goto[state[tid]][g_packets[tid][i]] == FAIL)

		state[tid] = g_ac_fail[state[tid]];

	state[tid] = g_ac_goto[state[tid]][g_packets[tid][i]];

	if (g_ac_output[state[tid]] == 1) {

		if (numOfResults[tid] != MAX_RESULTS) {

		//  g_ac_results[tid][numOfResults[tid]] = i;

		//  g_ac_results[tid][numOfResults[tid] + 1] = state[tid];

			numOfResults[tid] = numOfResults[tid] + 1;				

		}			

	}

}

__syncthreads();

g_ac_results[0] = numOfResults[0];

//g_ac_results[1] = numOfResults[1];

//g_ac_results[2] = numOfResults[2];

//g_ac_results[3] = numOfResults[3];

}[/codebox]

but even now i get the same error again. Although if instead of numOfResults[0] i use a hardcode int value like 14 in the code then there is no problem. I cannot figure out why the use of numOfResults cause so much execution time and increase the use use of registers from 2 to 9.

Also i noticed that the kernel runs fine for 1 block with 32 threads or 2 blocks of 16 threads each one but if i increase the total number of threads like 1 block of 64 threads then i have the same problem again.

Thorn_Striff · June 29, 2010, 3:46am

So, with one GPU, the only option i have is to make each kernel’s execution faster?
This sound crazy for me.

I’m working on a Runge Kutta’s implementation using CUDA and each execution of my kernels might take longer than 5s. No one knows any way to solve this without using 2 GPUs or without running X11?

Thorn_Striff · June 29, 2010, 3:46am

So, with one GPU, the only option i have is to make each kernel’s execution faster?
This sound crazy for me.

I’m working on a Runge Kutta’s implementation using CUDA and each execution of my kernels might take longer than 5s. No one knows any way to solve this without using 2 GPUs or without running X11?

SPWorley · June 29, 2010, 4:59am

There are four options:

Use a second display card. It doesn’t have to be a CUDA GPU. The cheapest $30 card will work.
Don’t run X11 on the GPU machine. This is pretty practical especially if you use a different PC to ssh into it and the machines share filesystems.
Make your kernels faster than 5 seconds. This is still practical and efficient… kernel launch overhead is really very very small, measured in microseconds.
Run Windows XP, which lets you disable the watchdog timer with registry hacks. (Your PC display still freezes, but it works)

SPWorley · June 29, 2010, 4:59am

There are four options:

Use a second display card. It doesn’t have to be a CUDA GPU. The cheapest $30 card will work.
Don’t run X11 on the GPU machine. This is pretty practical especially if you use a different PC to ssh into it and the machines share filesystems.
Make your kernels faster than 5 seconds. This is still practical and efficient… kernel launch overhead is really very very small, measured in microseconds.
Run Windows XP, which lets you disable the watchdog timer with registry hacks. (Your PC display still freezes, but it works)

Topic		Replies	Views
Launch Timeouts CUDA Programming and Performance	32	22023	May 4, 2011
Need to remove timeouts and the "launch timed out and was terminated" message CUDA Programming and Performance	20	11511	May 24, 2010
unspecified launch failure kernel fails if a loop is too long CUDA Programming and Performance	8	42932	April 25, 2007
the launch timed out and was to terminated? CUDA Programming and Performance	3	1184	April 3, 2009
unknown error, cudaGetLastError = 30 CUDA Programming and Performance	3	1815	April 21, 2011
Fatal error:the launch timed out and was terminated CUDA Programming and Performance	5	9833	April 19, 2016
kernel time limitations and launch time outs CUDA Programming and Performance	1	906	February 19, 2009
"the launch timed out and was terminated" error I'm getting this error but not every t CUDA Programming and Performance	3	6847	October 21, 2010
question about "launch timed out" CUDA Programming and Performance	2	1430	April 24, 2009
Error in lunching a kernel "the launch timed out and was terminated" CUDA Programming and Performance	1	920	April 13, 2011

the launch timed out and was terminated.

Related topics