Hardware damage

Smokey · August 3, 2009, 1:18am

I’ve been meaning to make this thread for a while - but having just damaged by 8800 as well, I think I need to bring it up.

Using CUDA w/ the Driver API - I’ve found it’s possible to (and very easily to accidentally) damage the hardware.

To date I’ve damaged both a Geforce 8800 GTX and a Quadro NVS 140M - simply by out of bounds memory transactions, which in some cases caused the driver to crash - however due to the fact I can’t definitively say the Quadro was damaged by CUDA - I’ll disregard it from now on.

Our 8800 GTX was absolutely damaged by CUDA, sometime around CUDA 2.1-2.3 - around about the time the new drivers came out which added better support for crash recovery (eg: instead of BSODs for the watchdog timer, or a really bad crash - the driver would simply reset).

To date the ‘damage’ has been purely superficial, corrupting only 2D drawing operations (my title bars, windows login screen, adobe reader, cmd.exe, etc - all show signs of graphical corruption - but not overly major). I’ve always had graphical corruption after writing out of bounds memory before, however it’s always been cured by rebooting (cold boot required in some cases) - however I’m only reporting this case because it’s permanent damage - rebooting, cold booting, even taking the card out of the motherboard for a few days doesn’t fix this issue.

To my knowledge, the card has only experienced out of bounds smem and gmem transactions (reads and writes) - I’m not aware of any other operations I’ve done that have caused corruption/crashes/etc.

Firstly, I’d like to know if this is a known (either internally, or publicly) issue? If so, is it ‘all cards’, or only some architectures/models? Is there a driver fix (and known ‘bad’ driver versions?)

And most importantly, how far can the damage go? (Clearly I don’t care so much about the 2D graphics stuff having random pixels flying all over it - but if this can affect the results of 3D rendering via OpenGL, or worse - computational results from the SPs… this is VERY BAD)

The last thing I want is to ship out a tech demo to a strategic partner, only to have it crash on some corner case - destroying their video card in the process…
(But the fact I’ve damaged one, possibly 2 cards - in a matter of 10 months of fulltime CUDA development… tells me it’s certainly a possibility.)

Edit: Okay, time flies - I’ve been working w/ CUDA for almost a year now… not 6 months.

tmurray · August 3, 2009, 2:28am

The only way you can damage your machine from a CUDA app is the same way you can damage it by running a 3D app–the card gets hot when it’s in use, you don’t have adequate cooling, and this causes some component somewhere to fail. Out of bounds memory accesses, driver API, whatever–there is absolutely no series of calls or illegal operations that you can do that will damage the hardware, period.

The memory on your card has almost certainly just gone bad, which can happen over time and is one of the leading killers of consumer cards (especially when they’re run at high frequencies with less than average cooling for a long period of time). But no, CUDA didn’t kill your card.

Smokey · August 3, 2009, 2:36am

I should note when I said I’m absolutely sure CUDA was the cause, I only say that because after I got a BSOD from a crashing kernel (which the driver couldn’t recover from - tends to happen after 5-6 recoveries in a row - the driver just can’t recover anymore) - the corruption started after rebooting from the BSOD.

It seems like one hell of a coincidence for my card to have overheated (damaging the memory) at the exact same time my kernel crashed the driver resulting in a BSOD…

As for the other card, I don’t know what killed it (I’m not certain it was CUDA though, I’m guessing bad handling - it got moved between machines a lot).

Edit: Is it possible the video card was in a somewhat corrupted state when the driver crashed - causing it to do something which increased power consumption (and thus heat)?

That said, I don’t think it was an overheating issue (cooling in this is better than anything I have at home, and I’ve never experienced issues at home w/ my 8600 or 260)…

I guess I’ll just have to keep an eye out for when it happens to the next card, try and get more detailed information.

tmurray · August 3, 2009, 2:45am

That’s actually not that unusual. If you look at a lot of the gaming forums where people complain about bad cards, it will crash during a game, cause a BSOD, reboot, and voila corruption. Nothing the driver does re: power consumption or anything will persist across a boot, so no, that’s not going to cause the problem.

Smokey · August 3, 2009, 2:54am

Hmm, okay.

It’s not a huge issue, I must admit - it is purely superficial. All of my CUDA kernels, and the OpenGL rendering we do pass our unit tests - the CUDA kernels pass regression and accuracy/performance testing too - so it’s not like the card is seriously damaged.

I’m guessing it’s just a few fragments of the first X Kb of memory which are damaged (which is probably where the framebuffer resides?) - hence the graphical corruption I’m seeing on applications with a low graphical refresh rate.

Atomiktoaster · August 3, 2009, 7:04pm

Is the implication here that non-consumer cards (i.e. Quadro or Tesla) are less likely to suffer from memory failure? I’d be interested in any technical reasons for this difference, as well as any quantification of how less likely it might be. My experience is that a lot of people choose consumer cards for professional CUDA work, and decreased reliability might be an argument against that strategy.

tmurray · August 3, 2009, 7:25pm

The tolerances at all levels are certainly a lot higher on the professional cards than the consumer cards, that’s for sure. The testing is also significantly greater. The clocks are different, the way the chips are binned is different, etc.

Consumer cards are perfectly fine for development, but I would definitely be wary about deploying a production machine where uptime matters with a consumer card.

jack · August 4, 2009, 7:03pm

I believe that someone wrote a small memory testing app and posted it on the forum once…perhaps you could run that on your card and see if it can locate the bad memory.

seibert · August 6, 2009, 5:57pm

memtestG80:

[url=“http://folding.stanford.edu/English/DownloadUtils”]http://folding.stanford.edu/English/DownloadUtils[/url]

Topic		Replies	Views
Hardware failure following invalid memory access an expensive problem... CUDA Programming and Performance	3	2556	October 27, 2009
screen artifacts Strange dots/pixels apear on screen when I run my app CUDA Programming and Performance	3	2550	December 12, 2008
Cuda security CUDA Programming and Performance	4	3715	September 22, 2008
Blue screen crash CUDA Programming and Performance	5	3529	April 19, 2009
PC crashing everytime the CUDA program crashes any way to prevent this/ is it normal for this to hap CUDA Programming and Performance	5	3129	March 9, 2009
Two 8800 GTX cards with Intel Core 2 Duo would this work? CUDA Programming and Performance	19	13123	October 2, 2007
Need a program that guarantees a fatal crash to reset my card CUDA Programming and Performance	10	4690	July 21, 2007
Can CUDA make my computer down ?! CUDA Programming and Performance	2	7284	April 10, 2009
GPU in a bad state - only power cycle helps CUDA Programming and Performance	6	2253	March 24, 2011
cuda 2.2 bug? CUDA Programming and Performance	29	19736	May 3, 2010

Hardware damage

Related topics