Driver crashs while OpenCL app is running


I have a strange problem. My Video Driver crashes while my OpenCL application is running. I tried many different drivers (197.45, 257.21, 258.96) but none helped the cudatoolkit version is 3.1. My Application is for fluidic simulation and i use OpenGL for the visualization but i don’t use the OpenGL interop extension.

The System is a Workstation with 3 GPUs one GTX 285 and two Tesla C1060 running Windows 7 x64. The GTS 285 is used for the visualization and all the OpenCL computing is done on one of the Teslas.

The memory usage of OpenCL is around 17MB so that can’t be the problem.
If the driver has crashed all OpenCL calls fail with an “CL_OUT_OF_RESOURCES” error.

Now my question is, what could be the reason for the driver crash?

Best regards
Jan Burgmeier

If one of your kernels runs for more than 2 seconds straight (without returning), Windows’ watchdog timer kicks in and kills it. This is an OS mechanism to keep the desktop responsive if a GPU program runs into an infinite loop or something.

This usually causes a driver crash (screen blinks and you get a “Driver has stopped responding and was restarted” popup from tray).

You can disable the timer with a registry hack, split the offending kernel into a few shorter ones or run on a GPU that’s not attached to a desktop.

EDIT: I just re-read that you run it on one of the Teslas. It’s probably something else then… Might be an illegal memory access or something like that, segfaults can also trigger a driver restart.

WDDM TDR is disabled and the Kernels a run on a Tesla card so no display is attached.
Can it be, that a division by zero or a #QNAN let the driver crash? If i read out my OpenCL Buffers there is sometimes a #QNAN i debug it at the moment to find the source of it.

The Driver crashs also when there is no #QNAN or devision by zero.

I forgot to restart after switching TDR off. Now the driver won’t restart thx.

Best regards
Jan Burgmeier

I’ve had major issues with the driver restarting and program crashing, even for simple kernels. Even though they compile fine, the code would do something funky that would crash the driver. What I suggest is figuring out which kernel, and then, within the kernel, returning in specific points in the kernel to see what line(s) of code are crashing the driver.