Cuda Error #4 that requires PC Reboot, Help!!!

Hi!

So I have an CUDA application (multiple differnte kernels) that runs on a Quadro 6000 with cuda driver 320.00 running on TCC mode, on cuda runtime 5.0, with Windows 2008 Server R2. The application runs flawlessly for multiple times (about 150 runs) and sometimes randomly I get a CUDA Error #4 and I cannot restart the application, after the first time I get it I keep getting it until I reboot the PC. I have already tried running cuda-memcheck it does not detect any error in my kernels (even if the cuda 4 is hit). When I build the kernels with -G I do not hit the error (So I’m guessing some kind of timing issue going on), any help will be greatly appreciate it!!

Randomly or deterministically? Error #4 is what, unspecified launch failue? i.e. a segfault. When you get the error in debug, see what you’re accessing… my guess is you have out of bounds errors or are passing incorrect launch parameters to the kernel. Try commenting, replacing statements around where the error occurs until you figure out where the problem is coming from.

Is randomly. The error is unspecified launch failure as you are saying. I do not get the error in debug so I cannot see what the kernel (or which kernel for that matter)is accessing. I do not think is parameters being passed to the kernels, as I launch the same kernels over and over with the same arguments. What really throws me off is that after the error a cudaDeviceReset() does not clear the error and I keep getting the unspecified launch failure for ANY cuda call.

Another thing to add, the frequency of the error being hit seems to vary depending in the PC that is running on (I have around 9 different systems all of them HP DL370 with the same hardware)

It is difficult to diagnose such issues remotely, so here is a laundry list of things to look into that I use when I encounter such problems.

Is the CUDA code clean of out-of-bounds accesses and race conditions according to cuda-memcheck? Does the host code pass valgrind (or Windows equivalent)? Have you had a chance to try CUDA 5.5?

It is curious that only one system out of nine identical ones running the same code would show these issues. Is this system in any way different from the other eight? Could there be an issue with a noisy power supply, or insufficient cooling? Have the power connectors on the GPU been checked, and the GPU re-seated in the PCIe slot?

Does this Quadro GPU support ECC? If so, I would recommend turning it on to guard against bits getting corrupted over time.

Cuda-memcheck did not detect any error. I have not used valgrind, I’ll investigate about a windows equivalent. I have not tried cuda 5.5 yet, when I build my code in 5.5 it complains about an issue on the PTX (I have not investigated why).

Out of the 9 servers the problem is shown in 5 not only 1. The Quadro GPU does support ECC and actually I have it on in all of them.

Any idea why the runtime will go crazy and not reset properly with the cudaDeviceReset() call??? Or if there is another way to reset?

I see such issues (driver restart required after CUDA error) only on very rare occasions. Keep in mind that most of the time I am using internal development drivers that have various bugs. I am not familiar with the driver internals but I have speculative chalked up the need to restart the driver to the fact that some GPU errors may lead to a corruption of driver state which it cannot recover from. My Windows platform is (for historical reasons) WinXP64, so I have no hands-on experience with TCC.

It is possible that the particular driver version you are using has an issue recovering from some CUDA errors in certain situations, which is why I would suggest trying the latest driver package available for the Quadro.

I think it would be a good idea to find out why the CUDA errors occur in the first place. If error #4 is indeed an unspecified launch failure (you may want to print the error string to make sure), those are almost always due to bugs in user code in my experience. An example is an unchecked CUDA API status return. When it fails (e.g. out of memory, resulting in an invalid device pointer), some downstream kernel will throw an ULF.

I am using the driver version 320.57 which is some weeks old only. All my cuda calls have a cuda_check and I am printing the string when any error is returned. Most likely it is a bug in my code but I’m trying to isolate what would cause such a failure and work my way back. Would a kernel accessing device memory that is valid at kernel launch but in the middle of the launch is freed up by another thread would cause the driver failure/corruption that you are talking about??

Seems like you have the latest driver, my Windows7 machine here has a Quadro 2000 running 320.49.

Freeing memory that is still being operated on would be bad indeed, that could lead to scribbling on all kind of other data. That applies regardless of whether one is talking of CPU or GPU code, as this could easily destroy data needed for memory management, for example.

I do not know how much “damage” the driver is designed to recover from. You may want to consider filing a bug via the registered developer website for more robust error recovery, attaching a self-contained version of your code as a repro case.

Yeah, I mean I do not know if that condition is actually happening, but just checking what would the “Damage” be, so it sounds that probably I should create a kernel that actually replicates that condition and see if the “damage” is what I’m seeing in the more complex application.

Hi forgot to say this is what I see in nvsmi after the error:

C:\Program Files\NVIDIA Corporation\NVSMI>nvidia-smi.exe
Wed Sep 04 16:14:23 2013
±-----------------------------------------------------+
| NVIDIA-SMI 5.320.00 Driver Version: 320.00 |
|-------------------------------±---------------------±---------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 ERR! TCC | ERR! ERR! | ERR! |
|ERR! ERR! ERR! ERR! / ERR! | 10MB / 6143MB | ERR! ERR! |
±------------------------------±---------------------±---------------------+
| 1 Quadro FX 3800 WDDM | 0000:28:00.0 N/A | N/A |
| 30% 78C N/A N/A / N/A | 997MB / 998MB | N/A Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| 0 ERROR: GPU is lost |
| 1 Not Supported |
±----------------------------------------------------------------------------+

Update: After a lot of searching ways to reproduce consistently and try to find why only some systems would reproduce it I found this:

http://h20565.www2.hp.com/portal/site/hpsc/template.PAGE/public/psi/swdDetails/?spf_p.tpst=swdMain&spf_p.prp_swdMain=wsrp-navigationalState%3Didx%3D4%7CswItem%3Dwk_106353_1%7CswEnvOID%3D%7CitemLocale%3D%7CswLang%3D%7Cmode%3D4%7Caction%3DdriverDocument&javax.portlet.begCacheTok=com.vignette.cachetoken&javax.portlet.endCacheTok=com.vignette.cachetoken

After checking the systems that could reproduce the problem I confirmed that the Quadro 6000 running in those particular systems did have the Hynix memory and BIOS 70.00.57.00.03 and the ones that never reproduced it had samsung memory. After updating to BIOS to 70.00.6F.00.04 the problem was gone!!.

So actually the problem was never in my code…

Is there a way that I can request the BIOS version through CUDA or Windows, to add some code to detect this condition programatically and inform the user to update the Quadro 6000 BIOS???

I am sorry for the inconvenience but glad to hear you were able to track down the root cause and fix it. I tried looking at the link you posted but it does not seem to work for me [restricted access?]. I was not aware of such VBIOS/memory interaction. This falls under the 0.1% of cases where an ULF is not due to issues in user code.

nvidia-smi -q reports the “VBIOS Version”. Would that be sufficient for your needs?

Well I want to use my code to do the checks runtime without calling system functions. Is there another way?,

About the link just google “quadro 6000 firmware update” and click on the link “HP Z820 Workstation - NVIDIA Quadro 6000 Video BIOS (ROM) and …” and in there you will see the description of the BIOS version 70.00.6F.00.04 and on the fixes tab it’ll show

"The following fixes was added to the NVIDIA Q6000 Video BIOS 70.00.6F.00.04:

  • Fixes instability and corrupt rendering issues on Q6000s with Hynix memory."

As I was saying the problem was only seen in the Q6000 with the Hynix memory. Is there a way to know exactly how this instability was fixed (or how it was caused for that matter) and if I can expect my users to not hit this problem again? Should I swap those units for the ones with samsung memory? or how to capture future problems like this?

Have a look at NVML. As far as I know, nvidia-smi is implemented on top of this library:

https://developer.nvidia.com/nvidia-management-library-nvml
“Identification: Various dynamic and static information is reported, including board serial numbers, PCI device ids, VBIOS/Inforom version numbers and product names”

I haven’t used NVML myself, so that’s about as much information as I am able to provide. I have no insights into your other questions.

NVML should do it, thanks so much for all the info!! For the other questions, so far everything has been running flawlessly with the new BIOS (about 30x the average iterations that before reproduced the error), so it should be fine.