I’m seeing some strange behavior and wonder if it happens to others. I have a Supermicro 5038A with a 900w power supply and a Titan Black GPU. I can run the samples just fine but when I run my own program via PyCuda, maybe 50% of the time the server will suddenly die and then reboot. It only seems to happen when the computation is in fp64 mode (I have it enabled via nvidia-settings). It doesn’t take prolonged computations for the behavior to occur. It can happen immediately with a simple test problem (matrix-matrix multiplication as an example). Any idea what is going on?
Sudden reboots when the GPU is under heavy load is typically a sign of an insufficiently sized power supply. Although with a 900W power supply and a single Titan Black that seems unlikely. CPUs and GPUs can both experience sudden power spikes at the start of an application. That’s why there need to be reserves in the power supply. In my experience, rock-solid operation requires that total nominal wattage of all system components is less than, or equal to, 60% of the nominal PSU wattage. So in your case, that rule of thumb would allow for 540W worth of system components.
What are the other components in this machine: CPU(s), system memory (speed, size), mass storage? How old is the machine and the power supply specifically? Does it have a 80PLUS rating (e.g. Gold, Platinum) of any kind? Various components of PSUs deteriorate with age (e.g. capacitors) which can reduce performance and reliability, especially when they run hot for extended time periods.
Another possibility is that a particular portion of the power supply is being overloaded. Check the auxiliary PCIe power cables supplying the GPU: (1) Are there any converters, in particular six-pin to eight-pin PCIe power, or Molex-to-PCie power? (2) Are any Y-splitters involved? (3) Is daisy-chaining in use on any of the power cables that also supplies the GPU?
A PCIe auxiliary power cable with a 6-pin connector is specified to supply up to 75W, an auxiliary power cable with an 8-pin connector is rated for 150W. The Titan Black should have one 6-pin socket and one 8-pin socket, with the balance of the power supplied via the PCIe slot (which can deliver up to 75W, but with most NVIDIA GPUs never draws more than 40W or so).
Thanks for your reply! The system has a Xeon six-core E5-1650 , 64 GB of ram, a 500 GB ssd for boot/storage, and a GTX 750 ti to run the monitor. It’s probably 5-6 years old at this point, but certainly newer than the Titan Black. The 6-pin auxiliary PCIe power cable is directly from the PSU. The PSU didn’t have an 8-pin connector, so I’m using an LP4 to 8 Pin PCI Express Video Card Power Cable Adapter that has two LP4 connectors, one of which was connected directly to the PSU, the other daisy-chained to the DVD reader (which is never used). I changed the cabling so both LP4 connectors were direct to the PSU (separate cables) and it didn’t make any difference. The PSU has an 80 Plus Gold sticker.
This is perplexing. With DP enabled via nvidia-settings, I can run nbody -fp64 all day long and Cuda-Z with the heavy load box checked doesn’t even raise the fan rpm on the Titan Black while indicating 1.8 TFLOP fp64 performance. The test case that crashes the system is just a single matrix-matrix multiplication that doesn’t at all tax the cpu, ram, ssd, or 750 Ti. I put the test case in a loop and it crashes after one or two iterations.
Fyi, the system is available via Newegg: SUPERMICRO SuperWorkstation SYS-5038A-I Mid-Tower Server Barebone - Newegg.com
Edit: I was able to get it to fail with nbody by using n=60000 and the -fp64 flag. Fails immediately. Works ok for n=55000 and less. As an experiment, I tried running the 8-pin connector from a different power supply (an old PC power supply from the 80’s) and it wasn’t pretty. The GPU fan immediately pegged at full RPM and the computation hung. I unplugged the system from the wall for fear of damaging something. Seems ok now except for the original problem.
This is the likely culprit. I’d say 99% probability. A 4-pin Molex connector is designed to deliver about 50W as I recall. The 8-pin PCIe connector is supposed to deliver up to 150W, and your jury-rigged substitute can’t handle that much, i.e. it likely overloads a portion of the PSU. This causes voltage to drop, which then triggers a !PWRGOOD signal (or whatever the modern equivalent is) back to the motherboard and boom, the system performs a hard reset.
Actually, from having watched the power consumption of the specific case of matrix-matrix multiplication before, it causes the GPU to draw a lot of power. The important thing here is, when the matrix-multiplying app starts up, it causes a sudden increase in GPU power draw. Think of it as a almost a step function with a near instantaneous increase by 200W or more. This is exactly the kind of scenario that taxes a PSU the most, more than steady-state heavy power draw. The issue is not a thermal one, but an electrical one.
Hmmm, can you suggest a solution? Am I out of options if my PSU doesn’t have an 8-pin connector? I’ve seen 6-pin to 8-pin adapters and my PSU has spare 6-pin cables, but I’m guessing that won’t work.
If you want rock-solid operation, you want a PSU that offers proper 8-pin and 6-pin PCIe auxilliary power connectors. No substitutes. 80PLUS Gold is OK as a minimum standard for a relatively small system. For a new purchase I would suggest 80PLUS Platinum rated (and even 80PLUS Titanium rated units for large servers), especially if electrical power is on the expensive side as it is in California, where I live. The components and build quality of those higher-rated PSUs also tend to be higher.
I have in the past used a 6-pin to 8-pin PCIe power converter cable in a particular machine with GPUs of similar power requirements as the Titan Back. That worked without problems for a number of years. I got very lucky: apparently this PSU had a ton of engineering margin built-in, maybe because it shipped in an expensive system from a major brand. You may not be so lucky.
I have used pretty much every conceivable bad way of settting up GPU power supply myself in the past, so my recommendations come from a place of experience.
I’ll try a 6-pin to 8-pin converter cable first. If that doesn’t work, I’ll consider getting a new PSU. Thanks so much for your comments!
The 6-pin to 8-pin connector is marginally better. With DP enabled, the matrix multiplication doesn’t crash the system but nbody still does with n=60000. Looks like I need to consider a new PSU.