Blue screen crash

Hi everybody.

I’m just starting learning CUDA, and practising. Until now, when I made some mistake in my code, I usually got quite informative messages, like “Invalid parameter configuration” or “Out of memory”. But, suddenly, the same code crashed down the computer. A blue screen appeared saying that it was related to the NVIDIA card, but it did not give me too much time to figure something more.

I am using VS 2008, Vista, and CUDA 2.1beta. Do you think it could be related to the driver? Or just some mad pointer out of control? Is it possible to achieve such a big crash just by misusing pointers?

I was trying to implement a silly, dummy median filter ( it even uses Bubblesort inside!! ), with a window size of 11, that gets executed over a 2900x4300 approx. image. The grid is 4368x91 after defining the blocksize to be 1x32. I was doing it this way, so I could fit some buffers I needed for sorting into shared memory. I estimate the shared memory size to be threadsPerBlockwindowSizewindowSize. In this case, that is 32x121=3872 bytes (< 16384 bytes of shared mem).

Well. Hope you can give me a hint.

Thanks in advance.

Gustavo.

I did happen to get the exact same issue a couple of days ago. The BSOD says “Display driver has stopped responding and could not be recovered.”

Yes, the GPU memory does not have any protection mechanism like the system memory (segmentation fault/access violation). This makes it extremely easy to overwrite display data when running on a GPU with a monitor attached. If this happens, then the display driver will crash, and Vista will try to reset it. On rare occasions, this fails, most likely due to a problem in the driver, and Vista decides it was shot to death and draws the beautiful BSOD.

You mentioned you get out of memory errors before launching the kernel. Launching a kernel with a failed allocation is a very very very bad idea. In such a case, your kernel may be perfect, but tries to write to a memory location pointed to by a stray pointer. The results are the same as with out-of-bounds writes.

Also, CUDA 2.1 is out of beta, so I highly suggest fetching it.

I also had the same kind of problem :a blue screen and the same message error.
But after commenting some parts of the kernel, I was able to determine the origin of the error. It came from a __syncthreads() function placed inside a for loop. It seemed that it produced a deadly synchronization.

Hope this information is helpful !

If you get the “Display driver has stopped responding”, then it’s likely because you told the GPU to do something that is taking too long. This might be an infinite loop or a syncthreads that’s impossible to attain due to a branch in one or more threads. When windows says the display driver has stopped responding, what it really means in the context of CUDA is that, in Microsoft’s opinion, the display driver did not report completion of a kernel invocation quickly enough. Obviously the driver can’t report completion if the HW did not complete. Obviously the HW can’t complete some kernels due to programming errors.

Linux also has a watchdog timer which will kill the kernel at the driver level if it is attached to a display and takes too long. It is possible to disable this in Windows (not sure about other systems) but then it is possible to completely hang the system due to a programming error.

Also, regarding memory protection, tmurray has said, “There is memory protection. All of the addresses are virtual; there’s no mapping from random addresses to other places in memory with sensitive data.”

Well, there is something in CUDA that is capable of corrupting the display’s memory on fencing errors. Sometimes, the only way to recover is to restart the system.