Cuda Error #4 that requires PC Reboot, Help!!!

RDAVIDRR · August 28, 2013, 1:55am

Hi!

So I have an CUDA application (multiple differnte kernels) that runs on a Quadro 6000 with cuda driver 320.00 running on TCC mode, on cuda runtime 5.0, with Windows 2008 Server R2. The application runs flawlessly for multiple times (about 150 runs) and sometimes randomly I get a CUDA Error #4 and I cannot restart the application, after the first time I get it I keep getting it until I reboot the PC. I have already tried running cuda-memcheck it does not detect any error in my kernels (even if the cuda 4 is hit). When I build the kernels with -G I do not hit the error (So I’m guessing some kind of timing issue going on), any help will be greatly appreciate it!!

vacaloca · August 28, 2013, 3:40pm

Randomly or deterministically? Error #4 is what, unspecified launch failue? i.e. a segfault. When you get the error in debug, see what you’re accessing… my guess is you have out of bounds errors or are passing incorrect launch parameters to the kernel. Try commenting, replacing statements around where the error occurs until you figure out where the problem is coming from.

RDAVIDRR · August 28, 2013, 4:16pm

Is randomly. The error is unspecified launch failure as you are saying. I do not get the error in debug so I cannot see what the kernel (or which kernel for that matter)is accessing. I do not think is parameters being passed to the kernels, as I launch the same kernels over and over with the same arguments. What really throws me off is that after the error a cudaDeviceReset() does not clear the error and I keep getting the unspecified launch failure for ANY cuda call.

RDAVIDRR · August 28, 2013, 4:21pm

Another thing to add, the frequency of the error being hit seems to vary depending in the PC that is running on (I have around 9 different systems all of them HP DL370 with the same hardware)

njuffa · August 28, 2013, 5:58pm

It is difficult to diagnose such issues remotely, so here is a laundry list of things to look into that I use when I encounter such problems.

Is the CUDA code clean of out-of-bounds accesses and race conditions according to cuda-memcheck? Does the host code pass valgrind (or Windows equivalent)? Have you had a chance to try CUDA 5.5?

It is curious that only one system out of nine identical ones running the same code would show these issues. Is this system in any way different from the other eight? Could there be an issue with a noisy power supply, or insufficient cooling? Have the power connectors on the GPU been checked, and the GPU re-seated in the PCIe slot?

Does this Quadro GPU support ECC? If so, I would recommend turning it on to guard against bits getting corrupted over time.

RDAVIDRR · August 28, 2013, 6:22pm

Cuda-memcheck did not detect any error. I have not used valgrind, I’ll investigate about a windows equivalent. I have not tried cuda 5.5 yet, when I build my code in 5.5 it complains about an issue on the PTX (I have not investigated why).

Out of the 9 servers the problem is shown in 5 not only 1. The Quadro GPU does support ECC and actually I have it on in all of them.

RDAVIDRR · August 28, 2013, 6:23pm

Any idea why the runtime will go crazy and not reset properly with the cudaDeviceReset() call??? Or if there is another way to reset?

njuffa · August 28, 2013, 7:24pm

I see such issues (driver restart required after CUDA error) only on very rare occasions. Keep in mind that most of the time I am using internal development drivers that have various bugs. I am not familiar with the driver internals but I have speculative chalked up the need to restart the driver to the fact that some GPU errors may lead to a corruption of driver state which it cannot recover from. My Windows platform is (for historical reasons) WinXP64, so I have no hands-on experience with TCC.

It is possible that the particular driver version you are using has an issue recovering from some CUDA errors in certain situations, which is why I would suggest trying the latest driver package available for the Quadro.

I think it would be a good idea to find out why the CUDA errors occur in the first place. If error #4 is indeed an unspecified launch failure (you may want to print the error string to make sure), those are almost always due to bugs in user code in my experience. An example is an unchecked CUDA API status return. When it fails (e.g. out of memory, resulting in an invalid device pointer), some downstream kernel will throw an ULF.

RDAVIDRR · August 28, 2013, 8:53pm

I am using the driver version 320.57 which is some weeks old only. All my cuda calls have a cuda_check and I am printing the string when any error is returned. Most likely it is a bug in my code but I’m trying to isolate what would cause such a failure and work my way back. Would a kernel accessing device memory that is valid at kernel launch but in the middle of the launch is freed up by another thread would cause the driver failure/corruption that you are talking about??

njuffa · August 28, 2013, 9:32pm

Seems like you have the latest driver, my Windows7 machine here has a Quadro 2000 running 320.49.

Freeing memory that is still being operated on would be bad indeed, that could lead to scribbling on all kind of other data. That applies regardless of whether one is talking of CPU or GPU code, as this could easily destroy data needed for memory management, for example.

I do not know how much “damage” the driver is designed to recover from. You may want to consider filing a bug via the registered developer website for more robust error recovery, attaching a self-contained version of your code as a repro case.

RDAVIDRR · August 28, 2013, 9:46pm

Yeah, I mean I do not know if that condition is actually happening, but just checking what would the “Damage” be, so it sounds that probably I should create a kernel that actually replicates that condition and see if the “damage” is what I’m seeing in the more complex application.

RDAVIDRR · September 4, 2013, 11:05pm

Hi forgot to say this is what I see in nvsmi after the error:

±----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| 0 ERROR: GPU is lost |
| 1 Not Supported |
±----------------------------------------------------------------------------+

RDAVIDRR · September 16, 2013, 9:10pm

Update: After a lot of searching ways to reproduce consistently and try to find why only some systems would reproduce it I found this:

http://h20565.www2.hp.com/portal/site/hpsc/template.PAGE/public/psi/swdDetails/?spf_p.tpst=swdMain&spf_p.prp_swdMain=wsrp-navigationalState%3Didx%253D4%257CswItem%253Dwk_106353_1%257CswEnvOID%253D%257CitemLocale%253D%257CswLang%253D%257Cmode%253D4%257Caction%253DdriverDocument&javax.portlet.begCacheTok=com.vignette.cachetoken&javax.portlet.endCacheTok=com.vignette.cachetoken

After checking the systems that could reproduce the problem I confirmed that the Quadro 6000 running in those particular systems did have the Hynix memory and BIOS 70.00.57.00.03 and the ones that never reproduced it had samsung memory. After updating to BIOS to 70.00.6F.00.04 the problem was gone!!.

So actually the problem was never in my code…

RDAVIDRR · September 16, 2013, 9:16pm

Is there a way that I can request the BIOS version through CUDA or Windows, to add some code to detect this condition programatically and inform the user to update the Quadro 6000 BIOS???

njuffa · September 17, 2013, 12:40am

I am sorry for the inconvenience but glad to hear you were able to track down the root cause and fix it. I tried looking at the link you posted but it does not seem to work for me [restricted access?]. I was not aware of such VBIOS/memory interaction. This falls under the 0.1% of cases where an ULF is not due to issues in user code.

nvidia-smi -q reports the “VBIOS Version”. Would that be sufficient for your needs?

RDAVIDRR · September 17, 2013, 5:56pm

Well I want to use my code to do the checks runtime without calling system functions. Is there another way?,

About the link just google “quadro 6000 firmware update” and click on the link “HP Z820 Workstation - NVIDIA Quadro 6000 Video BIOS (ROM) and …” and in there you will see the description of the BIOS version 70.00.6F.00.04 and on the fixes tab it’ll show

"The following fixes was added to the NVIDIA Q6000 Video BIOS 70.00.6F.00.04:

Fixes instability and corrupt rendering issues on Q6000s with Hynix memory."

As I was saying the problem was only seen in the Q6000 with the Hynix memory. Is there a way to know exactly how this instability was fixed (or how it was caused for that matter) and if I can expect my users to not hit this problem again? Should I swap those units for the ones with samsung memory? or how to capture future problems like this?

njuffa · September 17, 2013, 7:59pm

Have a look at NVML. As far as I know, nvidia-smi is implemented on top of this library:

[url]https://developer.nvidia.com/nvidia-management-library-nvml[/url]
“Identification: Various dynamic and static information is reported, including board serial numbers, PCI device ids, VBIOS/Inforom version numbers and product names”

I haven’t used NVML myself, so that’s about as much information as I am able to provide. I have no insights into your other questions.

RDAVIDRR · September 17, 2013, 9:14pm

NVML should do it, thanks so much for all the info!! For the other questions, so far everything has been running flawlessly with the new BIOS (about 30x the average iterations that before reproduced the error), so it should be fine.

Topic		Replies	Views
CUDA Bug: "CUDA error: unspecified launch failure" CUDA Programming and Performance	7	11715	March 11, 2011
Always got this warning when nvprof cuda file "This can happen if device ran out of memory or if a device kernel was stopped due to an assertion" on just HellowWorld GPU CUDA Programming and Performance	9	2560	January 31, 2019
Cuda-gdb doesn't break and/or step into Kernels CUDA Programming and Performance	26	53787	August 1, 2011
Simple CUDA program hitting size limits/errors on Windows but not Linux CUDA Programming and Performance	23	1940	January 12, 2019
Potential Bug, cuda-memcheck can someone verify? Program crashing on GPU initialisation with cuda-me CUDA Programming and Performance	11	3463	April 24, 2020
Silent kernel failure CUDA Programming and Performance	25	8320	May 18, 2020
Linux kernel 5.10+ CUDA_ERROR_MISALIGNED_ADDRESS CUDA Setup and Installation	0	832	September 29, 2021
Cuda support for legacy GPUs CUDA Setup and Installation	14	8361	November 29, 2016
deviceQuery passes and then fails CUDA Setup and Installation	4	2161	July 6, 2016
Cuda cannot find my graphic card? CUDA Setup and Installation	5	2413	April 9, 2019

Cuda Error #4 that requires PC Reboot, Help!!!

Related topics