Hardware damage

I’ve been meaning to make this thread for a while - but having just damaged by 8800 as well, I think I need to bring it up.

Using CUDA w/ the Driver API - I’ve found it’s possible to (and very easily to accidentally) damage the hardware.

To date I’ve damaged both a Geforce 8800 GTX and a Quadro NVS 140M - simply by out of bounds memory transactions, which in some cases caused the driver to crash - however due to the fact I can’t definitively say the Quadro was damaged by CUDA - I’ll disregard it from now on.

Our 8800 GTX was absolutely damaged by CUDA, sometime around CUDA 2.1-2.3 - around about the time the new drivers came out which added better support for crash recovery (eg: instead of BSODs for the watchdog timer, or a really bad crash - the driver would simply reset).

To date the ‘damage’ has been purely superficial, corrupting only 2D drawing operations (my title bars, windows login screen, adobe reader, cmd.exe, etc - all show signs of graphical corruption - but not overly major). I’ve always had graphical corruption after writing out of bounds memory before, however it’s always been cured by rebooting (cold boot required in some cases) - however I’m only reporting this case because it’s permanent damage - rebooting, cold booting, even taking the card out of the motherboard for a few days doesn’t fix this issue.

To my knowledge, the card has only experienced out of bounds smem and gmem transactions (reads and writes) - I’m not aware of any other operations I’ve done that have caused corruption/crashes/etc.

Firstly, I’d like to know if this is a known (either internally, or publicly) issue? If so, is it ‘all cards’, or only some architectures/models? Is there a driver fix (and known ‘bad’ driver versions?)

And most importantly, how far can the damage go? (Clearly I don’t care so much about the 2D graphics stuff having random pixels flying all over it - but if this can affect the results of 3D rendering via OpenGL, or worse - computational results from the SPs… this is VERY BAD)

The last thing I want is to ship out a tech demo to a strategic partner, only to have it crash on some corner case - destroying their video card in the process…
(But the fact I’ve damaged one, possibly 2 cards - in a matter of 10 months of fulltime CUDA development… tells me it’s certainly a possibility.)

Edit: Okay, time flies - I’ve been working w/ CUDA for almost a year now… not 6 months.

The only way you can damage your machine from a CUDA app is the same way you can damage it by running a 3D app–the card gets hot when it’s in use, you don’t have adequate cooling, and this causes some component somewhere to fail. Out of bounds memory accesses, driver API, whatever–there is absolutely no series of calls or illegal operations that you can do that will damage the hardware, period.

The memory on your card has almost certainly just gone bad, which can happen over time and is one of the leading killers of consumer cards (especially when they’re run at high frequencies with less than average cooling for a long period of time). But no, CUDA didn’t kill your card.

I should note when I said I’m absolutely sure CUDA was the cause, I only say that because after I got a BSOD from a crashing kernel (which the driver couldn’t recover from - tends to happen after 5-6 recoveries in a row - the driver just can’t recover anymore) - the corruption started after rebooting from the BSOD.

It seems like one hell of a coincidence for my card to have overheated (damaging the memory) at the exact same time my kernel crashed the driver resulting in a BSOD…

As for the other card, I don’t know what killed it (I’m not certain it was CUDA though, I’m guessing bad handling - it got moved between machines a lot).

Edit: Is it possible the video card was in a somewhat corrupted state when the driver crashed - causing it to do something which increased power consumption (and thus heat)?

That said, I don’t think it was an overheating issue (cooling in this is better than anything I have at home, and I’ve never experienced issues at home w/ my 8600 or 260)…

I guess I’ll just have to keep an eye out for when it happens to the next card, try and get more detailed information.

That’s actually not that unusual. If you look at a lot of the gaming forums where people complain about bad cards, it will crash during a game, cause a BSOD, reboot, and voila corruption. Nothing the driver does re: power consumption or anything will persist across a boot, so no, that’s not going to cause the problem.

Hmm, okay.

It’s not a huge issue, I must admit - it is purely superficial. All of my CUDA kernels, and the OpenGL rendering we do pass our unit tests - the CUDA kernels pass regression and accuracy/performance testing too - so it’s not like the card is seriously damaged.

I’m guessing it’s just a few fragments of the first X Kb of memory which are damaged (which is probably where the framebuffer resides?) - hence the graphical corruption I’m seeing on applications with a low graphical refresh rate.

Is the implication here that non-consumer cards (i.e. Quadro or Tesla) are less likely to suffer from memory failure? I’d be interested in any technical reasons for this difference, as well as any quantification of how less likely it might be. My experience is that a lot of people choose consumer cards for professional CUDA work, and decreased reliability might be an argument against that strategy.

The tolerances at all levels are certainly a lot higher on the professional cards than the consumer cards, that’s for sure. The testing is also significantly greater. The clocks are different, the way the chips are binned is different, etc.

Consumer cards are perfectly fine for development, but I would definitely be wary about deploying a production machine where uptime matters with a consumer card.

I believe that someone wrote a small memory testing app and posted it on the forum once…perhaps you could run that on your card and see if it can locate the bad memory.

memtestG80:

http://folding.stanford.edu/English/DownloadUtils