Kernel problem, execution stop after ~15min


I’ve got a CUDA C/C++ code that launch a kernel with 512threads per blocks and 72498037 blocks. But there is a problem… When I launche my code, it run for approximately 15min and… Then blackscreen + driver problem and the program stop.

I set de Watchdog timer to an hour (3600s) So I think the problem does not come from here.

I run cuda-memcheck and I didn’t notice any memory error.

I check error from CUDA function like follow :


defined like :

#  define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort = true)
	if (code != cudaSuccess)
		printf("GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);

In main function my code is like :

int main(void)
  cudaMalloc -> pass
  cudaMalloc -> pass
  cudaMalloc -> pass
  cudaMalloc -> pass
  cudaMemcpy -> pass
  cudaMemcpy -> pass
  kernel <<< 614, 1024 >>> -> pass
  cudaStreamQuery -> pass
  cudaGetLastError -> pass
  cudaMemcpy -> pass
  otherkernel <<< 72498037 ,512 >>> -> pass
  cudaStreamQuery -> pass
//cudaDeviceSynchronize -> GPUassert: unknown error "path to the"
  cudaGetLastError -> pass
  cudaFree -> GPUassert: unknown error "path to the"
  cudaFree -> exited
  cudaFree -> exited
  cudaFree -> exited
  return (0);

If I add the commented line to the main function I get the error on this line, if not on the first free.
After the error the code exit.

So… What’s happening ?

And an other thing, I cannot launch 1024 threads on the second kernel… Don’t know why, so I’ll try to reinstall cudaToolkit and driver. And see if it works

I suspect your error checking is not comprehensive. For a quick check, run your application under the control of cuda-memcheck, and fix all issues it reports. The fact that you cannot launch 1024 threads in a kernel could be indicative of a bug in your code, or that you are getting an out-of-resources error on a kernel launch that is currently being ignored.

In addition to checking CUDA API calls with a macro like gpuErrchk(), you need to check status after every kernel launch, which can give rise to both synchronously and asynchronously reported errors. For example:

// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR()                                          \
do {                                                                  \
    /* Check synchronous errors, i.e. pre-launch */                   \
    cudaError_t err = cudaGetLastError();                             \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
    /* Check asynchronous errors, i.e. kernel failed (ULF) */         \
    err = cudaThreadSynchronize();                                    \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString( err) );      \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

[s]Ok so till now I didn’t get any error and now… I got 129 error in my first kernel… And I don’t know what’s wrong with this function…

So here is the code :

kernel :

__global__ void kernelPosGeo(Cartesian posStartGeo, Cartesian *CUDA_geo)
 size_t tid;
 Global global;

 tid = threadIdx.x + blockIdx.x * blockDim.x;
 if (tid < 628319) // Number of wanted positions (on a circles spaced by 1E-5rad)
   CUDA_geo[tid] = global.rotationZAxis(posStartGeo, (double)tid * 1E-5); // Function that give position after  a tid * 1E-5 rad rotation around Z axis)

The main function (just before the call and the copy to host)

int main(void)
  Cartesian geo(42164200, 0, 0), *CUDA_geo;
  calcAngle angle;
  cudaMalloc((void **)&CUDA_geo, sizeof(Cartesian) * 628319);
  cudaMalloc((void **)&angle.tabGeo, sizeof(Cartesian) * 628319);
  kernelPosGeo <<< 614, 1024 >>> (geo, CUDA_geo);
  cudaMemcpy(angle.tabGeo, CUDA_geo, sizeof(Cartesian) * 628319, cudaMemcpyDeviceToDevice);

And the part of previous class that matter :

class Cartesian
 double x;
 double y;
 double z;
 //some functions

class calcAngle
 Cartesian *tabGeo;
 //some functions;
 //some variables;

So if I look to what I did. I passed a Variable to a Kernel (and I can print his value) and an Array allocated on the device.

I run the kernel and affect the new position to the corresponding index of this array. I allocated 628319 “block” and I set 628319 value.

I take back that value in the main an copy it to an other array calcAngle.tabGeo which was previousely allocated on the device too (628319 * Cartesian) same size 628319 * Cartesian.

I used this function for a while and did not have any memory error till now.

What cuda-memcheck says it’s :

“Invalid global write of size 8
at 0x00000190 in kernelPosGeo(Cartesian, *Cartesian)
by thread (641, 0, 0) in block (19, 0, 0)
Address 0x100c46ae is misaligned”

And I get this error 129 times… So what did I make wrong here ? (changing index of thread but not from the block)

From my point of view I respected everything…

Edit : Where must I use your #define ? Every CUDA API call ? If so I got an error “unspecified launch failure” on CUDA API call to cudaThreadSynchronize.[/s]

Edit 2: Ok so… It work just misunderstanding your define. After 15 minutes again it fail, blackscreen and… Program stoped

With your define i get :

CUDA error in file “path to the file” in line 229 : unknown error.

Which is line where CUDA_LAUNCH_ERROR is called.

By CUDA memcheck I get :

Program hit cudaErrorUnknwon (error 30) due to “unknown error” on CUDA API call to cudaThreadSynchronize.

Savec Host “bla bla bla”

Error Summary : 1 error

So it’s the only one error I get from my code. The “kernel crash” But how it crash looks likes when I hit the Watchdog Time from Windows but like I said béfore. I change it to an other value. Actually it’s not 3600s anymore it’s 36000s :

This “unknown error” reported on multiple CUDA API calls suggests that there was an error in previous CUDA operations, and that this error was not caught. The error status then is sticky until cleared. If you have not done so yet, I would recommend adding proper error checking to all API calls and kernel launches.

I am not familiar with manipulations of the operating system’s GUI watchdog timer, and the details may differ by operating system version. It is usually possible to limit kernel run times to less than the watchdog time limit (typically around 2 seconds) without a negative impact on application performance, so I would suggest looking into that.

I haven’t hit a watchdog limit in years. I seem to recall that it leads to a specific “timeout” error message rather than an “unknown error” but I could be wrong in that. I am afraid that with the little information provided this issue can’t be diagnosed remotely (at least I can’t).

Hi, so for the previous CUDA API calls I check it like follow (not really this way but it’s what is done) :

cudaError_t error;
error = CUDA API call;
if (error != cudaSuccess)
  printf("%s\n", cudaGetErrorString(error));

So on that side I guess I’m pretty right ? If not I’ll see how to improve that.

The fact that you don’t change the watchdogs timer is because you don’t have to, but if you run something HUGE, the simple fact launch all those kernels will take a certain time and if it’s too long it will hit this watchdogs timer, so it’s what happening here, but I add more complex calculation and some other thing that take a little time to execute but, like I launch 37 118 994 960 kernels, it take a HUGE time to execute. So… I modified this timeout value.

And I guess I solved my problem. Like the way the kernel crashed looks like a “driver not responding problem” and make the kernel stop I go on the web and look how to turn off that watchdogs, so I set the TdrValue key to 0 and my program run for more than 1 hour, so it seems the problem is solved.

And an other thing:
If you don’t ever reach the watchdog timer is it because you work on some things to accelerate a program like a raytracer or IDK… AI ?

I use CUDA because I’ve got an application that make a lot of calculation, and using CPU multithreading it takes like… too long.
So I use CUDA to accelerate this.

So basically my program will take in parameter a position of a satellite on the geostationnary orbit and “return” in an array 628319 position (every position spaced by 1E-5 rads) and I’ll use all those position in an other kernel that will take in parameters those positions, the cartesian coordinates of an earth station and a bunch of satelitte with their own orbital parameter. And for a given time I’ll calculate the position of the eart station and the position of all satellite and for each satellite I’ll determine the smallest angle formed by the eart station-geostationnary satellite-satellite

So on a previous test I had 2 satellite and approximately 67 000 000 timestep
The programe take 3.8sec to execute and 6.8s if I store the result. So… Even here I hit the watchdog timer set to 2 sec. It’s wy I needed to change it to an other value.

By the way, Thank for every response I learn something here !

You could compute half a gazillion time steps across multiple kernel launches, I don’t see a need why they would have to run within a single kernel launch. Example: I run Folding@Home on my home computer, which is GPU accelerated. While an entire simulation run can take up to 36 hours before it returns results to the F@H server, each individual kernel launch is only about 0.1 to 0.2 seconds.

Yeah But somewhere in your program you’ve got a loop ? And this loop is less efficient than a big kernel right ? I tested to launch multiple kernel and did not have really good result…

It’s not my code, but yes, presumably Folding@Home is running a big simulation loop with hundreds of thousands of kernel launches. The loss in efficiency is negligible: 5-10 usec of kernel launch time versus 100-200 msec of kernel run time, that is an impact of <= 1/100 of one percent, or <= 13 seconds of overhead in 36 hours of wall-clock run time. The data stays resident on the GPU the entire time.