Tesla C2050 acting flakey?

MFago · December 16, 2010, 9:55pm

I have a C2050 that usually appears to be acting fine. My kernels run without errors, cuda-memcheck reports no errors, and the results are correct.

However, every now and then the card produces incorrect results and/or reports kernel execution times that are nearly zero. All without any errors reported by CUDA. This can take a while to go away when it occurs.

The GPU temperature is 74. This is running in an Alienware Area-51 with a 1kW PSU, i7-980x, OpenSuSE 11.2, and a GTX-480 for display (the computer is rated for dual GTX-480).

When this happens, the GTX-480 still seems to be working fine.

Does anyone know what might be going on? Do I have a bad Tesla?

Thanks,
Matt

plegresley · December 17, 2010, 12:37am

I’d recommend running something as a stress test (like nbody or a DGEMM sweep) for an extended period of time. Also, try looking in the /var/log/messages file to see if there are any messages from the NVIDIA driver.

SPWorley · December 17, 2010, 4:53am

What’s the PSU brand? What wattage are you pulling at full load, measured from the wall socket?

I always drop into hardware stability threads and remind people that PSUs are the most critical component for multiGPU systems.
Also, don’t use a UPS.

tera · December 17, 2010, 8:53am

I completely agree with SPWorley. Also check how much power is available on the 12V rails, how it is divided between the different 12V rails and how they are connected to GPUs and Motherboard. The availability of 1kW overall does not necessarily imply that each GPU gets enough power.

eyalhir74 · December 17, 2010, 11:41am

Why not use a UPS? can u please elaborate on that please?

thanks

eyal

MFago · December 17, 2010, 2:52pm

I was forced to buy Dell – actually AlienWare was the only Dell that officially supported the C2050 power draw at that time.

I’ll have to try to round-up a kill-a-watt.

Thanks,

Matt

MFago · December 17, 2010, 2:57pm

The GPUs are connected via separate cables from the PSU, but I don’t know what the 12v rail design is. Being a Dell I have no idea how to get that information. I’ll try asking Dell.

The system is rated for dual GTX-480 in SLI, but that does not rule out a bad PSU.

Note also that I am currently only using one GPU at a time (other than basic 2D display via the GTX-480).

Thanks,

Matt

Dittoaway · December 17, 2010, 3:30pm

Do you have ECC memory enabled?

MFago · December 17, 2010, 3:36pm

Not at the moment. My executions are very fast (a few seconds) and I’d like the additional bandwidth.

It is very hard to test as it seems to happen every few weeks. When it does happen, the card does not behave properly until a restart.

Dittoaway · December 17, 2010, 4:59pm

Hmmm. In a previous career I worked with 5 mm diameter solid state infrared detectors. Its amazing how many cosmic ray hits you get even in a detector that size. Given the infrequency of your problem, that may be what is happening. That’s why I asked about ECC. However, if it’s requiring a reset, ECC probably isn’t enough to help.

Just a thought.

SPWorley · December 17, 2010, 5:44pm

A UPS may degrade or limit the AC power supplied to your PC. Especially if you’re using high power. It doesn’t matter if the UPS claims to handle that wattage… don’t trust them!

Their failure can cause very subtle problems.

Don’t use a UPS!

Dittoaway · December 17, 2010, 6:13pm

Sorry, but I have to disagree. You can buy a quality UPS to provide any power you need at whatever quality power you need. All of the serious HPC servers in the world are hanging off of UPSs. How would you like a week’s work to disappear because of a power glitch?

SPWorley · December 17, 2010, 7:05pm

Yep, I agree, actually. So let me amend that… don’t use a UPS unless it’s one that’s rated for HPC GPU server reliability. I don’t know enough about such high end UPSes.

My experience with the “professional” Cyberpower CP1500AVRLCD UPS 1500VA/900W wasted a stressful week’s worth of my time, which is infinitely more valuable than a week’s worth of compute.

MFago · December 17, 2010, 7:26pm

I can second that. I removed my UPS shortly after installing it. “Real” units are quite expensive, and hard to get approval for. Just hope I don’t get any more power spikes … Teslas are even more pricey.

Topic		Replies	Views
Strange freezes with Tesla C2050 - Help needed! Help needed!!!! CUDA Programming and Performance	63	7495	March 1, 2011
Subtle CUDA bug finally diagnosed CUDA Programming and Performance	7	2930	May 16, 2010
Tesla c2050 Idle Power Consumption CUDA Programming and Performance	11	4432	April 29, 2011
Underperforming Tesla/Titan CUDA Programming and Performance	3	728	March 8, 2019
Tesla C2050 and C2070 specs CUDA Programming and Performance	5	12596	December 26, 2009
M2050 cooling Passive cards in a non-server case CUDA Programming and Performance	16	10479	June 28, 2017
Suggestions for motherboard, case and PS for dual M1060 setup Many questions about what system parts CUDA Programming and Performance	4	7818	January 15, 2011
Building my own Tesla workstation CUDA Programming and Performance	12	13315	August 14, 2010
C2050 strange behaviour C2050 won't work in Dell Precision T7500 (win7 nor ubuntu) CUDA Programming and Performance	4	1387	January 26, 2011
dynamic downclocking doesn't work CUDA Programming and Performance	2	1574	March 4, 2011

Tesla C2050 acting flakey?

Related topics