cudaDeviceSynchronize doesn't work if the kernel function takes too long to complete

Hi,

I have developed an application which has this problem (the topic title).
At first I thought it was a problem of my application but then I tried creating a standard CUDA
project with Parallel NSight 2.1 (the one which adds 5 integers in parallel) and added a loop
of 100000000 times to repeat c[i] = a[i] + b[i]; and it gives me the same problem.
If I repeat the loop only 10000 times (for example) it works perfectly.

Here is what happens:
Everything works fine, even the kernel function call, but as soon as cudaDeviceSynchronize is called,
a black screen appears for some seconds, then the video comes back and windows notifies me with
“Display driver stopped responding and has recovered!”.

The cudaDeviceSynchronize function returns with the “cudaErrorLaunchTimeout” error;

I tried this Parallel NSight app both on my laptop and on my computer and they both have this behavior.

Here is the code (it’s the same as the standard Parallel NSight project) plus one line of code:

#include “cuda_runtime.h”
#include “device_launch_parameters.h”

#include <stdio.h>

cudaError_t addWithCuda(int *c, const int *a, const int *b, size_t size);

global void addKernel(int *c, const int *a, const int *b)
{
int i = threadIdx.x;

for(int k = 0; k < 100000000; k++) /* the only new line of code */
    c[i] = a[i] + b[i];

}

int main()
{
const int arraySize = 5;
const int a[arraySize] = { 1, 2, 3, 4, 5 };
const int b[arraySize] = { 10, 20, 30, 40, 50 };
int c[arraySize] = { 0 };

// Add vectors in parallel.
cudaError_t cudaStatus = addWithCuda(c, a, b, arraySize);
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "addWithCuda failed!");
    return 1;
}

printf("{1,2,3,4,5} + {10,20,30,40,50} = {%d,%d,%d,%d,%d}\n",
    c[0], c[1], c[2], c[3], c[4]);

// cudaDeviceReset must be called before exiting in order for profiling and
// tracing tools such as Parallel Nsight and Visual Profiler to show complete traces.
cudaStatus = cudaDeviceReset();
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaDeviceReset failed!");
    return 1;
}

return 0;

}

// Helper function for using CUDA to add vectors in parallel.
cudaError_t addWithCuda(int *c, const int *a, const int *b, size_t size)
{
int *dev_a = 0;
int *dev_b = 0;
int *dev_c = 0;
cudaError_t cudaStatus;

// Choose which GPU to run on, change this on a multi-GPU system.
cudaStatus = cudaSetDevice(0);
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
    goto Error;
}

// Allocate GPU buffers for three vectors (two input, one output)    .
cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(int));
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMalloc failed!");
    goto Error;
}

cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(int));
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMalloc failed!");
    goto Error;
}

cudaStatus = cudaMalloc((void**)&dev_b, size * sizeof(int));
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMalloc failed!");
    goto Error;
}

// Copy input vectors from host memory to GPU buffers.
cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMemcpy failed!");
    goto Error;
}

cudaStatus = cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMemcpy failed!");
    goto Error;
}

// Launch a kernel on the GPU with one thread for each element.
addKernel<<<1, size>>>(dev_c, dev_a, dev_b);

// cudaDeviceSynchronize waits for the kernel to finish, and returns
// any errors encountered during the launch.
cudaStatus = cudaDeviceSynchronize();
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
    goto Error;
}

// Copy output vector from GPU buffer to host memory.
cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMemcpy failed!");
    goto Error;
}

Error:
cudaFree(dev_c);
cudaFree(dev_a);
cudaFree(dev_b);

return cudaStatus;

}

You are hitting the watchdog timer timeout which is there to prevent your computer from becoming unusable because the user interface cannot update the screen anymore. Either limit your kernel to a runtime of less than about two seconds, or use a dedicated card for CUDA which does not run a display driver (under Windows this requires a Tesla card).

The fact that you only see the timeout after the cudaDeviceSynchronize() is an artifact of kernel batching. Kernels are delayed and launched in batches only after the cudaDeviceSynchronize() to mitigate the effect of slow kernel launches under Windows.

So under windows, if I buy the geforce gtx 590, I can’t use it as if they are 2 video cards, one for the display driver and the other for my computation?

http://www.geforce.com/Hardware/GPUs/geforce-gtx-590/specifications

isn’t there any way to remove the watchdog timer? I don’t care if it the screen stays idle. I just need to execute a long farm-style computation and leave it working till it finishes.

And if I can’t on windows, can I do it on linux?

Also I don’t care if I have to give up on few multiprocessors for the display driver, if needed.

Use Linux and run without X. Look at the release notes or search the forum on how to set up CUDA in Linux without X.