I share my experience so you don’t have to repeat my frustration.
I’ve been tearing out my hair this week trying to find an issue with one of my tools which failed only after about 10 minutes of compute. The error was a simple Unknown Kernel Failure, which usually implies a memory issue (like writing beyond bounds.) The typical debugging involved reproducing it (not hard, with many data sets, though all were big and none were small). Tests passed memorycheck and Ocelot. Every debug idea led down to a dead end… I’m sure you know the feeling.
One clue was that a single GTX480 GPU was OK, it was only dual GPU that failed. Hardware was fine… I swapped out GPUs and got the same issue. Heat was fine, temps were at 85 degrees (normal) but an extra test of a box fan pointed at the open PC lowered temps to about 72 and the problem still occured.
Wall wattage was about 650 watts. This was with a 850 watt PSU (Corsair Professional). The PC was plugged into a heavy-duty UPS, a Cyberpower 1500VA (good for 900 watts sustained.). Prime95 runs worked fine so there seemed to be no CPU issue. GTX295s use the same power as the GTX480s and had the same random failures, though at a different point (probably because there were 4 slower GPUs instead of 2 fast ones.)
Other code did not fail, so I kept hammering at the problem with debug logs, with problem partitoning, with Ocelot, with emulator traces, with GDB breakpoints. No progress. I was in the swamp. I was lost. For several days. We all know that feeling.
Last night, I found the problem. Or rather, the problem revealed itself to me. After yet another failed test, I shut off the PC and was surfing the web on my laptop. Five minutes later (!) I hear a physical “THUNK”… it sounded like a screwdriver fell off my desk or something. I see nothing and keep surfing. Then about a minute later, I smell magic electronics smoke… that sickly plastic ozone odor. Uhoh…
It was the UPS. With no load on it (!) it had fried itself.
This morning, with the PC plugged straight into the wall, the “CUDA bug” was gone. Obviously the UPS was not delivering the current needed for the system at full load (even though that load was under the rated specs for the UPS and the PSU.) I double and triple checked the UPS specs (900 watts sustained) and my use of less than 700 wall watts was well under this.
So… I post my story here to leave a mark in your own memories that limited wattage can cause subtle CUDA bugs! And you won’t have ANY clues about it… you don’t get any alarms or signals from your PC when the voltages drop… just crashing. I am VERY lucky my UPS failed… I had no clue it was the cause, even after thinking about it when trying to diagnose the problem. (The PC I was using was not my main machine… the 850 Watt PSU is in theory too small for 2 GTX480s but that’s also why I had carefully monitored its actual power use and was confident the PSU was OK.)
I really liked the UPS… I chose it because of its hefty rating plus it has a built in real-time wattage display (which is how I knew the power use, though I also have a Kill-o-Watt). I’ll move my other compute PCs off of UPSes as well. My conclusion is you can’t trust their ratings, especially for sustained use which is obviously stressful.