Consequences of insufficient GPU power

Dear All,

I’m wondering what are the consequences of insufficient power fed to a GPU. We have a 4-295GTX box, and I’ve noticed that large problems always fail (i.e. GPU returns crappy result) after some arbitrary time on the GPU recognized by CUDA as #7. Always on GPU 7, no matter what CPU thread is mapped to it. So I have several ideas why it happens –

sounds like it is not a bug because the problem doesn’t depend on the CPU-GPU mapping
lack of power
? BIOS issue
faulty hardware
? something else

which one in you opinion is the most probable?

Thanks!

Temperature should be on your list. Can you monitor the operating temperature of your cards while running? It’s possible they are getting dangerously hot, and GPU #7 happens to be the most susceptible to bit errors.

Yep, heat is a big deal… especially with 4x295s. Check the GPU core temperatures with something like GPU-Z.

You might diagnose this if you could recreate the error only after the cards have been running at full load for >30 seconds or something.
Alternatively try a test with the PC’s side cover off and a big box fan blowing full power at the cards, just to see if extreme cooling changes anything.

To rule out a flaky GPU, swap the card orders in the machine, see if the problem moves with the card (bad card) or if it now affects the card that is in the same slot (perhaps ventilation.)

You can use tools like rivatuner to UNDERCLOCK your cards… this might identify if its a hardware issue as well.

A simple GPU memory checker: http://sourceforge.net/projects/cudagpumemtest/

And BTW, when you have joined the elite 4x295 club, you are required to talk about your PSU/case/MB choices, so we can jealously covet your awesomeness. Pay attention to seibert, he’s a member of the club and posted a great geek-porn summary of his hardware a few months back.

What PSU are you using?

The classic sign of not enough power in Windows systems is it powers itself down. You don’t get a vote.

That is pretty not true, you can have all sorts of voltage ripple issues that result in random behavior of hardware that draws a lot of power (see: video cards).

Also, 4x295 machines make my cube hot (and require two power supplies and a second motherboard to serve as a switch to compensate for my laziness).

I agree that there could be other issues as well…

But i know we had issues with the Ultra X3 1000W PSU not being able to power systems running single 280’s…

http://www.evga.com/FORUMS/tm.aspx?m=35862

The classic sign of members who had issues was the system would power itself off under load.
I think they trip some over-current protection built into the PSU’s.

I also agree Heat could very well be a factor too. :)

I see. I didn’t build the machine, we simply bought it, so I don’t know much about the exact specs. But according to the manuals I have it’s dual Xeon 5520 installed on Supermicro x8dth-series main board. The GPUs are 295GTX from BFG, they’re liquid-cooled. The PSU is galaxy evo 1250 + booster x5 which feeds one of the cards.

It’s somewhat hard to unplug liquid-cooled cards, maybe I’l try to disable PCIe slots instead and see if the problem goes away. The core temperature is never higher than 60C up to the point when the program fails (but who knows what the real temperature at some particular transistor is). And yes, the program fails after ~15-30 min, and the ratio of wrong numbers returned by gpu is around 1.0e-6. I’ll try to run the memory and dgemm stress tests.

OK, given your setup (liquid cooled? wow!) and your observed temperatures, I would be surprised if this is a temperature problem. You don’t usually start seeing strange answers until you get near 90C or higher.

Something like memtestG80 might be a good idea as well.

GPU 7 failed the GPU memory stress test after ~15 minutes. It wasn’t a good idea to buy bfg cards for cuda

Sounds like you should go back to the system builder and get a replacement card. If they aren’t covering this sort of thing, I would imagine BFG would also replace it.

You bump into problems like this occasionally with consumer cards. Nominally, this is why people pay for Tesla. :)

No doubt we’ll get a replacement card. But I wouldn’t be so sure that BFG would replace it. I think you just can’t come and say, look, this card doesn’t multiply large matrices correctly – give me a new one. Simply because the card was made to play games.