unspecified launch failure kernel fails if a loop is too long

Hello,

I’m having an issue with CUDA not launching my kernel depending on the length of one of the loops in a nested loop structure. With a few iterations it works fine, however if the number of iterations is increased, CUDA issues the error message “unspecified launch failure”.

Here is the calling program:

// includes, system

#include <stdlib.h>

#include <stdio.h>

#include <string.h>

#include <math.h>

// includes, project

#include <cutil.h>

// includes, kernels

#include <template_kernel.cu>

////////////////////////////////////////////////////////////////////////////////

// declaration, forward

void runTest( int argc, char** argv);

////////////////////////////////////////////////////////////////////////////////

// Program main

////////////////////////////////////////////////////////////////////////////////

int

main( int argc, char** argv) 

{

    runTest( argc, argv);

   CUT_EXIT(argc, argv);

}

////////////////////////////////////////////////////////////////////////////////

//! Run a simple test for CUDA

////////////////////////////////////////////////////////////////////////////////

void runTest( int argc, char** argv) 

{

   CUT_CHECK_DEVICE();

   unsigned int num_threads = 64;

	unsigned int len_DelT = 801;

   // allocate host memory

	float* TimeData = (float *) malloc(sizeof(float) * 16400000);

   // initalize host memory

	for(int i = 0; i < 16400000; i++)

  TimeData[i] = i;

   // allocate device memory

	float* d_TimeData;

	CUDA_SAFE_CALL( cudaMalloc( (void**) &d_TimeData, sizeof(float) * 16400000));

   // copy host memory to device

	CUDA_SAFE_CALL( cudaMemcpy( d_TimeData, TimeData, sizeof(float)*16400000, cudaMemcpyHostToDevice));

   // allocate device memory for result

    float* d_tss_r;

    CUDA_SAFE_CALL(cudaMalloc( (void**) &d_tss_r, sizeof(float) * 16400*len_DelT));

   // setup execution parameters

    dim3  grid( 1, 1, 1);

    dim3  threads( num_threads, 1, 1);

   // execute the kernel

	testKernel<<< grid, threads, 0 >>>(d_tss_r, d_TimeData, 16400, len_DelT, 1000);

   // check if kernel execution generated and error

    CUT_CHECK_ERROR("Kernel execution failed");

   // allocate mem for the result on host side

	float* h_tss_r_host = (float*) malloc(sizeof(float) * 16400*len_DelT);

   // copy result from device to host

    CUDA_SAFE_CALL( cudaMemcpy( h_tss_r_host, d_tss_r, sizeof( float) * 16400*len_DelT, cudaMemcpyDeviceToHost));

	for(int k = 0; k < 10; k++)

	{

  printf("h_tss_r_host[%d] = %g\n", k, h_tss_r_host[k]);

	}

   // cleanup memory

    free(TimeData);

	CUDA_SAFE_CALL(cudaFree(d_TimeData));

	

	CUDA_SAFE_CALL(cudaFree(d_tss_r));

	free(h_tss_r_host);	

}

and here is the kernel

__global__ void testKernel( float* d_tss_r, float* d_TimeData, unsigned int pts, int len_DelT, unsigned int scans)

{

	__device__ float *d_rpc;

	__device__ int r_index, c_index, k;

	for(r_index = 0; r_index < 16400; r_index++)

	{

  for(k = 0; k < len_DelT; k++)

  {

  	d_rpc = d_TimeData+r_index;

  	for(c_index = 0; c_index < scans; c_index++)

  	{

    d_rpc += pts;

  	}

  }

	}

	d_tss_r[threadIdx.x] = 1.0;

}

This was part of a larger program that I have stripped down as much as possible while still producing the error. Thus, the program in its current state does not do much and the nested loops do not contribute to the output. If I run the program with a small value for len_DelT (~100 or less), everything is fine. However if len_DelT is larger (such as 801 in this code now), CUDA fails with the aforementioned message.

Any help is appreciated.

Scott

Hey,

I have the same problem and submitted it to NVIDIA, the problem is caused by the execution time exceeding 5seconds. See the windows release notes:

Individual GPU program launches are limited to a run time of less than 5 seconds on the device. Exceeding this time limit usually causes a launch failure reported through the CUDA driver or the CUDA runtime, but in some cases hangs the entire machine, requiring a hard reset. Microsoft Windows has a “watchdog” timer that causes programs using the primary graphics adapter to time out if they run longer than the maximum allowed time. For this reason it is recommended that CUDA is run on a G80 that is NOT attached to a display and does not have the Windows desktop extended onto it. In this case, the system must contain at least one NVIDIA GPU that serves as the primary graphics adapter.

Thanks Jeroen, that fixed it.

Scott

I got the impression from some statements made by David Kirk that this limitation (runtime going too long wedging the card etc) was cured, or at least not necessarily applicable to all systems, is this also a problem for Linux?

John

It continues to be an issue on windows – the solution is to use a separate device for running your desktop and for running CUDA.

On Linux this is not as much of a problem, however we still recommend using another GPU for display because running long CUDA programs on the primary display GPU can lead to loss of interactivity.

Mark

Hi,

I am using a Geforce 6800GT as the primary graphics card for displaying and a 8800 GTX for computation. And the 97_73 driver is installed for both of them.

Although the “unspecified launch failure” error message does not appear again, the computation results are not always correct. It seems when the device runtime
is beyond 5 sec, the results are all “0”, when it is within 5 sec, results are correct.

Can someone help me fix this problem? Thanks

I have this same problem. No error messages, but the results are not correct. This seems to happen when the device executes more than 70ms. Using 2 nvidia video cards, one PCI… win xp.

I am running CUDA on Linux and I appear to have the same issue. The computer becomes totally unresponsive. Infact, I am on a school network so I can ssh into the machine and run ‘top’ to see which program is causing the problems. It turns out not to be my program, but the Xserver which is pegged at 100% CPU use. The strange thing is that I have gotten the program to work correctly three times (out the the dozens that I have tried).

I just created another thread with a similar problem description. I strongly suspect that my issue is exactly the same as the problem discussed here so I will paste my previous post here and edit my topic to reflect that fact.

I am also having similar issues on Linux. My code runs on emulation fine for millions of iterations, but on the G80, I seem to fail when I reach into the 10s of millions. Runtime is timed at > 7000 ms or so on failed iterations, and is < 7000 ms when it succeeds…

Is this a current problem with Linux?