unspecified launch failure kernel fails if a loop is too long

scott_b · February 22, 2007, 5:28pm

Hello,

I’m having an issue with CUDA not launching my kernel depending on the length of one of the loops in a nested loop structure. With a few iterations it works fine, however if the number of iterations is increased, CUDA issues the error message “unspecified launch failure”.

Here is the calling program:

// includes, system

#include <stdlib.h>

#include <stdio.h>

#include <string.h>

#include <math.h>

// includes, project

#include <cutil.h>

// includes, kernels

#include <template_kernel.cu>

////////////////////////////////////////////////////////////////////////////////

// declaration, forward

void runTest( int argc, char** argv);

////////////////////////////////////////////////////////////////////////////////

// Program main

////////////////////////////////////////////////////////////////////////////////

int

main( int argc, char** argv) 

{

    runTest( argc, argv);

   CUT_EXIT(argc, argv);

}

////////////////////////////////////////////////////////////////////////////////

//! Run a simple test for CUDA

////////////////////////////////////////////////////////////////////////////////

void runTest( int argc, char** argv) 

{

   CUT_CHECK_DEVICE();

   unsigned int num_threads = 64;

	unsigned int len_DelT = 801;

   // allocate host memory

	float* TimeData = (float *) malloc(sizeof(float) * 16400000);

   // initalize host memory

	for(int i = 0; i < 16400000; i++)

  TimeData[i] = i;

   // allocate device memory

	float* d_TimeData;

	CUDA_SAFE_CALL( cudaMalloc( (void**) &d_TimeData, sizeof(float) * 16400000));

   // copy host memory to device

	CUDA_SAFE_CALL( cudaMemcpy( d_TimeData, TimeData, sizeof(float)*16400000, cudaMemcpyHostToDevice));

   // allocate device memory for result

    float* d_tss_r;

    CUDA_SAFE_CALL(cudaMalloc( (void**) &d_tss_r, sizeof(float) * 16400*len_DelT));

   // setup execution parameters

    dim3  grid( 1, 1, 1);

    dim3  threads( num_threads, 1, 1);

   // execute the kernel

	testKernel<<< grid, threads, 0 >>>(d_tss_r, d_TimeData, 16400, len_DelT, 1000);

   // check if kernel execution generated and error

    CUT_CHECK_ERROR("Kernel execution failed");

   // allocate mem for the result on host side

	float* h_tss_r_host = (float*) malloc(sizeof(float) * 16400*len_DelT);

   // copy result from device to host

    CUDA_SAFE_CALL( cudaMemcpy( h_tss_r_host, d_tss_r, sizeof( float) * 16400*len_DelT, cudaMemcpyDeviceToHost));

	for(int k = 0; k < 10; k++)

	{

  printf("h_tss_r_host[%d] = %g\n", k, h_tss_r_host[k]);

	}

   // cleanup memory

    free(TimeData);

	CUDA_SAFE_CALL(cudaFree(d_TimeData));

	

	CUDA_SAFE_CALL(cudaFree(d_tss_r));

	free(h_tss_r_host);	

}

and here is the kernel

__global__ void testKernel( float* d_tss_r, float* d_TimeData, unsigned int pts, int len_DelT, unsigned int scans)

{

	__device__ float *d_rpc;

	__device__ int r_index, c_index, k;

	for(r_index = 0; r_index < 16400; r_index++)

	{

  for(k = 0; k < len_DelT; k++)

  {

  	d_rpc = d_TimeData+r_index;

  	for(c_index = 0; c_index < scans; c_index++)

  	{

    d_rpc += pts;

  	}

  }

	}

	d_tss_r[threadIdx.x] = 1.0;

}

This was part of a larger program that I have stripped down as much as possible while still producing the error. Thus, the program in its current state does not do much and the nested loops do not contribute to the output. If I run the program with a small value for len_DelT (~100 or less), everything is fine. However if len_DelT is larger (such as 801 in this code now), CUDA fails with the aforementioned message.

Any help is appreciated.

Scott

Jeroen · February 22, 2007, 5:59pm

Hey,

I have the same problem and submitted it to NVIDIA, the problem is caused by the execution time exceeding 5seconds. See the windows release notes:

Individual GPU program launches are limited to a run time of less than 5 seconds on the device. Exceeding this time limit usually causes a launch failure reported through the CUDA driver or the CUDA runtime, but in some cases hangs the entire machine, requiring a hard reset. Microsoft Windows has a “watchdog” timer that causes programs using the primary graphics adapter to time out if they run longer than the maximum allowed time. For this reason it is recommended that CUDA is run on a G80 that is NOT attached to a display and does not have the Windows desktop extended onto it. In this case, the system must contain at least one NVIDIA GPU that serves as the primary graphics adapter.

scott_b · February 22, 2007, 7:13pm

Thanks Jeroen, that fixed it.

Scott

tachyon_john · February 22, 2007, 8:48pm

I got the impression from some statements made by David Kirk that this limitation (runtime going too long wedging the card etc) was cured, or at least not necessarily applicable to all systems, is this also a problem for Linux?

John

Mark_Harris · February 25, 2007, 10:27pm

It continues to be an issue on windows – the solution is to use a separate device for running your desktop and for running CUDA.

On Linux this is not as much of a problem, however we still recommend using another GPU for display because running long CUDA programs on the primary display GPU can lead to loss of interactivity.

Mark

Archer · March 3, 2007, 1:37pm

Hi,

I am using a Geforce 6800GT as the primary graphics card for displaying and a 8800 GTX for computation. And the 97_73 driver is installed for both of them.

Although the “unspecified launch failure” error message does not appear again, the computation results are not always correct. It seems when the device runtime
is beyond 5 sec, the results are all “0”, when it is within 5 sec, results are correct.

Can someone help me fix this problem? Thanks

e.ping · April 12, 2007, 7:31am

I have this same problem. No error messages, but the results are not correct. This seems to happen when the device executes more than 70ms. Using 2 nvidia video cards, one PCI… win xp.

codercat · April 13, 2007, 3:51pm

I am running CUDA on Linux and I appear to have the same issue. The computer becomes totally unresponsive. Infact, I am on a school network so I can ssh into the machine and run ‘top’ to see which program is causing the problems. It turns out not to be my program, but the Xserver which is pegged at 100% CPU use. The strange thing is that I have gotten the program to work correctly three times (out the the dozens that I have tried).

I just created another thread with a similar problem description. I strongly suspect that my issue is exactly the same as the problem discussed here so I will paste my previous post here and edit my topic to reflect that fact.

I am having an odd problem with memory access when using CUDA and I hope that I am writing this in the correct place. If I am not, I would be happy to move it elsewhere.

I am using CUDA to process two large sets of data where each element of the first set is given its own thread which must then access each element of the second set. Neither set of data is being written to, however I do have room in global space for results to be written. The results area is indexed by the thread ID. Inside each thread I have a FOR loop which iterates over each element of the second set of data, does a calculation and stores the result in the global memory set aside for that thread. If I do this calculation outside of the loop, on a single element of the second data set, everything works fine and I get results back. The problem is that when I do the calculation for each element of the second data set and write the result to the same place (it is used later in another calculation which takes place inside the loop and works fine, but only writes its results once after the loop is finished) the device process appears to quit early and I get no usable data out of the program. I say the process quits early because it takes 0.1 seconds to finish instead of the 10 seconds it should take.

I was able to solve the problem for when I have a small second data set by adding a __syncthreads() call after the calculation inside the loop, but this stops working once the data set grows past a certain point, and actually causes the computer to hang indefinately, requiring a reboot.

Everything is being stored in global memory right now because I want to get the algorithm working before I start trying to optimize memory access time.

I hope someone is able to understand what I just said, I know it looks a little complex.

Thanks in advance.

Dan_Johnson · April 25, 2007, 9:31pm

I am also having similar issues on Linux. My code runs on emulation fine for millions of iterations, but on the G80, I seem to fail when I reach into the 10s of millions. Runtime is timed at > 7000 ms or so on failed iterations, and is < 7000 ms when it succeeds…

Is this a current problem with Linux?

Topic		Replies	Views
Short kernels cause "unspecified launch failure" CUDA Programming and Performance	3	12363	December 30, 2009
Launch Timeouts CUDA Programming and Performance	32	21857	May 4, 2011
Unspecified launch failure error CUDA Programming and Performance	10	18814	January 6, 2018
ask for help with weird " unspecified launch failure" CUDA Programming and Performance	9	4080	November 5, 2010
the launch timed out and was terminated. CUDA Programming and Performance	6	23887	June 29, 2010
Another crash CUDA Programming and Performance	5	5471	July 12, 2007
problems with cuda on linux CUDA Programming and Performance	13	22221	May 16, 2007
"unspecified launch failure" runtime failure CUDA Programming and Performance	6	3337	May 9, 2009
unspecified launch failure in prior launch CUDA Programming and Performance	4	7728	September 24, 2007
kernel time limitations and launch time outs CUDA Programming and Performance	1	885	February 19, 2009

unspecified launch failure kernel fails if a loop is too long

Related topics