the launch timed out and was terminated.

Hi all,

i try to run the following kernel in a device capability 1.2 GPU but i get the following error

cuda_src/main.cu(198) : cudaSafeCall() Runtime API error : the launch timed out and was terminated.

From the ptx flag i get the following

ptxas info : Compiling entry function ‘Z15ac_searchKernelPPcPiPS1_S1_S1_S1’ for ‘sm_10’

ptxas info : Used 8 registers, 2048+0 bytes lmem, 2072+16 bytes smem, 16 bytes cmem[1]

cuda_src/main.cu(26): warning: variable “d_ac_results” was declared but never referenced

As far i can see there is no problem with the registers as i use 2 blocks of 256 threads each block.

Can anyone help me to find the error causing this?

[codebox]#include <stdio.h>

#ifndef MAX_RESULTS

#define MAX_RESULTS 320

#endif

#ifndef NUMOFPACKETS

#define NUMOFPACKETS 512

#endif

global void ac_searchKernel(char **g_packets, int *g_packetLen, int **g_ac_goto, int *g_ac_fail, int *g_ac_output, int *g_ac_results)

{

int i, tid, state[NUMOFPACKETS];

__shared__ int numOfResults[NUMOFPACKETS];

tid = blockIdx.x * blockDim.x + threadIdx.x;

state[tid] = 0;

numOfResults[tid] = 0;

__syncthreads();

for (i = 0; i < *g_packetLen; i++) {

	while (g_ac_goto[state[tid]][g_packets[tid][i]] == FAIL)

		state[tid] = g_ac_fail[state[tid]];

	state[tid] = g_ac_goto[state[tid]][g_packets[tid][i]];

	if (g_ac_output[state[tid]] == 1) {

		if (numOfResults[tid] != MAX_RESULTS) {

		//  g_ac_results[tid][numOfResults[tid]] = i;

		//  g_ac_results[tid][numOfResults[tid] + 1] = state[tid];

			numOfResults[tid] = numOfResults[tid] + 1;				

		}			

	}

}

__syncthreads();

g_ac_results[tid] = numOfResults[tid];

}[/codebox]

The error means that the kernel is taking more than 5 seconds to run, so a device driver protection mechanism is killing the kernel. This happens on GPUs which have a display manager on them. If you don’t run X11 on the card, you will probably find it works normally. If you need to use the card for display, your only choices are either make the code faster, or reduce the amount of work the kernel does per launch to keep the execution time to less than 5 seconds.

I try to reduce the execution time of the kernel by doing the following change

[codebox]global void ac_searchKernel(char **g_packets, int *g_packetLen, int **g_ac_goto, int *g_ac_fail, int *g_ac_output, int *g_ac_results)

{

int i, tid, state[NUMOFPACKETS];

__shared__ int numOfResults[NUMOFPACKETS];

tid = blockIdx.x * blockDim.x + threadIdx.x;

state[tid] = 0;

numOfResults[tid] = 0;

for (i = 0; i < *g_packetLen; i++) {

	while (g_ac_goto[state[tid]][g_packets[tid][i]] == FAIL)

		state[tid] = g_ac_fail[state[tid]];

	state[tid] = g_ac_goto[state[tid]][g_packets[tid][i]];

	if (g_ac_output[state[tid]] == 1) {

		if (numOfResults[tid] != MAX_RESULTS) {

		//  g_ac_results[tid][numOfResults[tid]] = i;

		//  g_ac_results[tid][numOfResults[tid] + 1] = state[tid];

			numOfResults[tid] = numOfResults[tid] + 1;				

		}			

	}

}

__syncthreads();

g_ac_results[0] = numOfResults[0];

//g_ac_results[1] = numOfResults[1];

//g_ac_results[2] = numOfResults[2];

//g_ac_results[3] = numOfResults[3];

}[/codebox]

but even now i get the same error again. Although if instead of numOfResults[0] i use a hardcode int value like 14 in the code then there is no problem. I cannot figure out why the use of numOfResults cause so much execution time and increase the use use of registers from 2 to 9.

Also i noticed that the kernel runs fine for 1 block with 32 threads or 2 blocks of 16 threads each one but if i increase the total number of threads like 1 block of 64 threads then i have the same problem again.

So, with one GPU, the only option i have is to make each kernel’s execution faster?
This sound crazy for me.

I’m working on a Runge Kutta’s implementation using CUDA and each execution of my kernels might take longer than 5s. No one knows any way to solve this without using 2 GPUs or without running X11?

So, with one GPU, the only option i have is to make each kernel’s execution faster?
This sound crazy for me.

I’m working on a Runge Kutta’s implementation using CUDA and each execution of my kernels might take longer than 5s. No one knows any way to solve this without using 2 GPUs or without running X11?

There are four options:

  1. Use a second display card. It doesn’t have to be a CUDA GPU. The cheapest $30 card will work.

  2. Don’t run X11 on the GPU machine. This is pretty practical especially if you use a different PC to ssh into it and the machines share filesystems.

  3. Make your kernels faster than 5 seconds. This is still practical and efficient… kernel launch overhead is really very very small, measured in microseconds.

  4. Run Windows XP, which lets you disable the watchdog timer with registry hacks. (Your PC display still freezes, but it works)

There are four options:

  1. Use a second display card. It doesn’t have to be a CUDA GPU. The cheapest $30 card will work.

  2. Don’t run X11 on the GPU machine. This is pretty practical especially if you use a different PC to ssh into it and the machines share filesystems.

  3. Make your kernels faster than 5 seconds. This is still practical and efficient… kernel launch overhead is really very very small, measured in microseconds.

  4. Run Windows XP, which lets you disable the watchdog timer with registry hacks. (Your PC display still freezes, but it works)