cudaDeviceSynchronize doesn't work if the kernel function takes too long to complete

cudacode · January 29, 2012, 2:58pm

Hi,

I have developed an application which has this problem (the topic title).
At first I thought it was a problem of my application but then I tried creating a standard CUDA
project with Parallel NSight 2.1 (the one which adds 5 integers in parallel) and added a loop
of 100000000 times to repeat c[i] = a[i] + b[i]; and it gives me the same problem.
If I repeat the loop only 10000 times (for example) it works perfectly.

Here is what happens:
Everything works fine, even the kernel function call, but as soon as cudaDeviceSynchronize is called,
a black screen appears for some seconds, then the video comes back and windows notifies me with
“Display driver stopped responding and has recovered!”.

The cudaDeviceSynchronize function returns with the “cudaErrorLaunchTimeout” error;

I tried this Parallel NSight app both on my laptop and on my computer and they both have this behavior.

Here is the code (it’s the same as the standard Parallel NSight project) plus one line of code:

#include “cuda_runtime.h”
#include “device_launch_parameters.h”

#include <stdio.h>

cudaError_t addWithCuda(int *c, const int *a, const int *b, size_t size);

global void addKernel(int *c, const int *a, const int *b)
{
int i = threadIdx.x;

for(int k = 0; k < 100000000; k++) /* the only new line of code */
    c[i] = a[i] + b[i];

}

int main()
{
const int arraySize = 5;
const int a[arraySize] = { 1, 2, 3, 4, 5 };
const int b[arraySize] = { 10, 20, 30, 40, 50 };
int c[arraySize] = { 0 };

// Add vectors in parallel.
cudaError_t cudaStatus = addWithCuda(c, a, b, arraySize);
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "addWithCuda failed!");
    return 1;
}

printf("{1,2,3,4,5} + {10,20,30,40,50} = {%d,%d,%d,%d,%d}\n",
    c[0], c[1], c[2], c[3], c[4]);

// cudaDeviceReset must be called before exiting in order for profiling and
// tracing tools such as Parallel Nsight and Visual Profiler to show complete traces.
cudaStatus = cudaDeviceReset();
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaDeviceReset failed!");
    return 1;
}

return 0;

}

// Helper function for using CUDA to add vectors in parallel.
cudaError_t addWithCuda(int *c, const int *a, const int *b, size_t size)
{
int *dev_a = 0;
int *dev_b = 0;
int *dev_c = 0;
cudaError_t cudaStatus;

// Choose which GPU to run on, change this on a multi-GPU system.
cudaStatus = cudaSetDevice(0);
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
    goto Error;
}

// Allocate GPU buffers for three vectors (two input, one output)    .
cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(int));
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMalloc failed!");
    goto Error;
}

cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(int));
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMalloc failed!");
    goto Error;
}

cudaStatus = cudaMalloc((void**)&dev_b, size * sizeof(int));
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMalloc failed!");
    goto Error;
}

// Copy input vectors from host memory to GPU buffers.
cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMemcpy failed!");
    goto Error;
}

cudaStatus = cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMemcpy failed!");
    goto Error;
}

// Launch a kernel on the GPU with one thread for each element.
addKernel<<<1, size>>>(dev_c, dev_a, dev_b);

// cudaDeviceSynchronize waits for the kernel to finish, and returns
// any errors encountered during the launch.
cudaStatus = cudaDeviceSynchronize();
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
    goto Error;
}

// Copy output vector from GPU buffer to host memory.
cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMemcpy failed!");
    goto Error;
}

Error:
cudaFree(dev_c);
cudaFree(dev_a);
cudaFree(dev_b);

return cudaStatus;

}

tera · January 29, 2012, 3:10pm

You are hitting the watchdog timer timeout which is there to prevent your computer from becoming unusable because the user interface cannot update the screen anymore. Either limit your kernel to a runtime of less than about two seconds, or use a dedicated card for CUDA which does not run a display driver (under Windows this requires a Tesla card).

The fact that you only see the timeout after the cudaDeviceSynchronize() is an artifact of kernel batching. Kernels are delayed and launched in batches only after the cudaDeviceSynchronize() to mitigate the effect of slow kernel launches under Windows.

cudacode · January 29, 2012, 3:40pm

So under windows, if I buy the geforce gtx 590, I can’t use it as if they are 2 video cards, one for the display driver and the other for my computation?

http://www.geforce.com/Hardware/GPUs/geforce-gtx-590/specifications

isn’t there any way to remove the watchdog timer? I don’t care if it the screen stays idle. I just need to execute a long farm-style computation and leave it working till it finishes.

And if I can’t on windows, can I do it on linux?

Also I don’t care if I have to give up on few multiprocessors for the display driver, if needed.

mfatica · January 29, 2012, 6:36pm

Use Linux and run without X. Look at the release notes or search the forum on how to set up CUDA in Linux without X.

Topic		Replies	Views
cudaSynchronizeDevice() returns error code 6 CUDA Programming and Performance	7	8601	June 16, 2011
The Cuda 5 Second execution-time limit Finding a the way to work around the GDI timeout CUDA Programming and Performance	24	12717	July 26, 2010
unable to get the cpu and gpu to run in parallel CUDA Programming and Performance	34	23203	October 7, 2010
cudaDeviceSynchronize error CUDA Programming and Performance	2	3819	February 17, 2014
Got wrong result when not using cudaDeviceSynchronize in threads CUDA Programming and Performance	6	838	February 1, 2024
Comparison Linux vs windows of "cudaDeviceSynchronize" CUDA Programming and Performance	7	2390	August 13, 2013
Can kernel function parallel with CPU code? CUDA Programming and Performance	12	7737	December 5, 2008
Newbie: async kernel, so I can do stuff on the CPU meanwhile, yeah? CUDA Programming and Performance	2	374	January 13, 2019
cudaDeviceSynchronize from device code is deprecated CUDA Programming and Performance	15	6842	March 18, 2024
cudaDeviceSynchronize() doesn't wait for cudaMemcpy to finish? CUDA Programming and Performance cuda , synchronization	3	2948	February 17, 2021

cudaDeviceSynchronize doesn't work if the kernel function takes too long to complete

Related topics