Subtle CUDA bug finally diagnosed

I share my experience so you don’t have to repeat my frustration.

I’ve been tearing out my hair this week trying to find an issue with one of my tools which failed only after about 10 minutes of compute. The error was a simple Unknown Kernel Failure, which usually implies a memory issue (like writing beyond bounds.) The typical debugging involved reproducing it (not hard, with many data sets, though all were big and none were small). Tests passed memorycheck and Ocelot. Every debug idea led down to a dead end… I’m sure you know the feeling.

One clue was that a single GTX480 GPU was OK, it was only dual GPU that failed. Hardware was fine… I swapped out GPUs and got the same issue. Heat was fine, temps were at 85 degrees (normal) but an extra test of a box fan pointed at the open PC lowered temps to about 72 and the problem still occured.

Wall wattage was about 650 watts. This was with a 850 watt PSU (Corsair Professional). The PC was plugged into a heavy-duty UPS, a Cyberpower 1500VA (good for 900 watts sustained.). Prime95 runs worked fine so there seemed to be no CPU issue. GTX295s use the same power as the GTX480s and had the same random failures, though at a different point (probably because there were 4 slower GPUs instead of 2 fast ones.)

Other code did not fail, so I kept hammering at the problem with debug logs, with problem partitoning, with Ocelot, with emulator traces, with GDB breakpoints. No progress. I was in the swamp. I was lost. For several days. We all know that feeling.

Last night, I found the problem. Or rather, the problem revealed itself to me. After yet another failed test, I shut off the PC and was surfing the web on my laptop. Five minutes later (!) I hear a physical “THUNK”… it sounded like a screwdriver fell off my desk or something. I see nothing and keep surfing. Then about a minute later, I smell magic electronics smoke… that sickly plastic ozone odor. Uhoh…

It was the UPS. With no load on it (!) it had fried itself.

This morning, with the PC plugged straight into the wall, the “CUDA bug” was gone. Obviously the UPS was not delivering the current needed for the system at full load (even though that load was under the rated specs for the UPS and the PSU.) I double and triple checked the UPS specs (900 watts sustained) and my use of less than 700 wall watts was well under this.

So… I post my story here to leave a mark in your own memories that limited wattage can cause subtle CUDA bugs! And you won’t have ANY clues about it… you don’t get any alarms or signals from your PC when the voltages drop… just crashing. I am VERY lucky my UPS failed… I had no clue it was the cause, even after thinking about it when trying to diagnose the problem. (The PC I was using was not my main machine… the 850 Watt PSU is in theory too small for 2 GTX480s but that’s also why I had carefully monitored its actual power use and was confident the PSU was OK.)

I really liked the UPS… I chose it because of its hefty rating plus it has a built in real-time wattage display (which is how I knew the power use, though I also have a Kill-o-Watt). I’ll move my other compute PCs off of UPSes as well. My conclusion is you can’t trust their ratings, especially for sustained use which is obviously stressful.

Thanks for the post. I have a Cyberpower 1000 with one 480 GTX. But have experienced the CUDA_UNKNOWN_ERROR with Xinerama and two cards.

I only considered the UPS watt output for the battery, and I assumed the UPS could handle any reasonable wattage when passing power from the outlet.

The display giving power usage and battery life is very informative compared with other brand name UPSs.

Thank you for your sharing!

If y’all don’t mind a tip from the gaming side:

We use an app called the OverClock Testing Tool (O.C.C.T.) that places a LinPack-type load onto the GPU while the same is running on the CPU in order to test out the PSU. The app itself isn’t as important as one of it’s monitoring functions - graphs of the sensor outputs. It also monitors the main voltage lines for fluctuations.

Again, the app itself isn’t as important as the concept and its use: As long as there are fluctuations on your lines, you’re not getting enough power and the system is trying to react to the shifting load. Since you have the test equipment, that’s where I’d focus on load tests.

In layman’s terms, fluctuations expose the area between rated sustained output and the peak output. You are most likely aware of that, however, but I wanted to point out that even this will affect the GeForce series of cards…

Vow! Thats really a tough one! Thanks for sharing!

I returned the UPS to Cyberpower today for warranty replacement, we’ll see what they say.

One interesting followup point. That UPS (and likely all others) has an audible alarm when overloaded. I knew it wasn’t overloaded, and the alarm also never sounded.
The UPS can also detect (and alarm) if input voltage drops (brownout) or spikes, but that also didn’t apply to this case since the input wall voltage was always fine… it was the UPS output failing (though exactly how, we don’t know.)

There are three types of UPSs. One is standby, which basically just watches the input power and if it fails, it switches on the battery and provides a backup in a few milliseconds.
This is the cheapest and most basic UPS.

The next type is the standard higher-quality UPS, which still passes through the line voltage, but the UPS can actually reshape and regularize it by using active analog power electronics without having to switch to the battery. This was what my UPS does. The plausible theory is that the constant, heavy, (but within spec limits!) load caused this “AC reshaping” electronics to fail. And as the supplied power started to fail, the PSU and then the GPUs would get garbage power, they’d crash the kernel, and suddenly the power load would lighten and the UPS would recover and the system would work fine. Subtle subtle.

The final type of UPS is the expensive one… it always runs the electronics off the battery, so there’s never any interruption of the pure output… no switching time, no variation. It takes more electronics and is harshest on the battery itself. It’s used a lot for the high end “this server can’t fail!” backups.

Thanks for sharing! Sorry to hear about your ordeal but i’m glad you figured it out :)

CyberPower just returned my UPS after a very fast (2 weeks) warranty exchange. I give them serious credit for the quick and easy process. I just paid shipping one way. They returned it at their expense (and included a new battery, since I sent the UPS out without it to save the extra 8 pounds)

There was no documentation about what was fried or anything, but I assume they just swapped out the electronics regulator board.

I’d still like to use the UPS but I of course cannot trust it, so I’ll use it for non-GPU machines. I wish I had an audible low-AC-voltage alarm in the PC PSU to detect the problem in the future.