Tesla C2050 acting flakey?

I have a C2050 that usually appears to be acting fine. My kernels run without errors, cuda-memcheck reports no errors, and the results are correct.

However, every now and then the card produces incorrect results and/or reports kernel execution times that are nearly zero. All without any errors reported by CUDA. This can take a while to go away when it occurs.

The GPU temperature is 74. This is running in an Alienware Area-51 with a 1kW PSU, i7-980x, OpenSuSE 11.2, and a GTX-480 for display (the computer is rated for dual GTX-480).

When this happens, the GTX-480 still seems to be working fine.

Does anyone know what might be going on? Do I have a bad Tesla?

Thanks,
Matt

I’d recommend running something as a stress test (like nbody or a DGEMM sweep) for an extended period of time. Also, try looking in the /var/log/messages file to see if there are any messages from the NVIDIA driver.

What’s the PSU brand? What wattage are you pulling at full load, measured from the wall socket?

I always drop into hardware stability threads and remind people that PSUs are the most critical component for multiGPU systems.
Also, don’t use a UPS.

I completely agree with SPWorley. Also check how much power is available on the 12V rails, how it is divided between the different 12V rails and how they are connected to GPUs and Motherboard. The availability of 1kW overall does not necessarily imply that each GPU gets enough power.

Why not use a UPS? can u please elaborate on that please?

thanks

eyal

I was forced to buy Dell – actually AlienWare was the only Dell that officially supported the C2050 power draw at that time.

I’ll have to try to round-up a kill-a-watt.

Thanks,

Matt

The GPUs are connected via separate cables from the PSU, but I don’t know what the 12v rail design is. Being a Dell I have no idea how to get that information. I’ll try asking Dell.

The system is rated for dual GTX-480 in SLI, but that does not rule out a bad PSU.

Note also that I am currently only using one GPU at a time (other than basic 2D display via the GTX-480).

Thanks,

Matt

Do you have ECC memory enabled?

Not at the moment. My executions are very fast (a few seconds) and I’d like the additional bandwidth.

It is very hard to test as it seems to happen every few weeks. When it does happen, the card does not behave properly until a restart.

Hmmm. In a previous career I worked with 5 mm diameter solid state infrared detectors. Its amazing how many cosmic ray hits you get even in a detector that size. Given the infrequency of your problem, that may be what is happening. That’s why I asked about ECC. However, if it’s requiring a reset, ECC probably isn’t enough to help.

Just a thought.

A UPS may degrade or limit the AC power supplied to your PC. Especially if you’re using high power. It doesn’t matter if the UPS claims to handle that wattage… don’t trust them!

Their failure can cause very subtle problems.

Don’t use a UPS!

Sorry, but I have to disagree. You can buy a quality UPS to provide any power you need at whatever quality power you need. All of the serious HPC servers in the world are hanging off of UPSs. How would you like a week’s work to disappear because of a power glitch?

Yep, I agree, actually. So let me amend that… don’t use a UPS unless it’s one that’s rated for HPC GPU server reliability. I don’t know enough about such high end UPSes.

My experience with the “professional” Cyberpower CP1500AVRLCD UPS 1500VA/900W wasted a stressful week’s worth of my time, which is infinitely more valuable than a week’s worth of compute.

I can second that. I removed my UPS shortly after installing it. “Real” units are quite expensive, and hard to get approval for. Just hope I don’t get any more power spikes … Teslas are even more pricey.