Hardware failure following invalid memory access an expensive problem...

3drenderer · October 27, 2009, 1:08pm

Hi all, hope I put this post in the right place. I’m relatively new to CUDA and let’s just say things aren’t going too well…

The story: I finally started porting my project from GPGPU/shaders into CUDA - I’m unwilling to go into much details on a public forum, but let’s just say it involves fluid dynamics. I’d not got far with my port (okay, actually a complete rewrite) and I’d just finished writing a new kernel and ran the program to test it out. Whoops, the screen started showing purple lines and flickering, indicating I’d got some pointer arithmetic wrong in accessing some data within CUDA’s linear memory space - how embarrassing :"> . Fair enough, I didn’t expect to get things right first time and a restart solved the problem. I eventually found the erroneous line - apparently a memory read from a pitched array. Unfortunately to find it I re-ran the program with the line enabled. Same thing happened, but this time a restart didn’t help - Windows got as far as the login screen and started displaying artefacts before locking up. I tried turning the computer off and leaving it off for a while, same problem. To make matters worse, all the shutdowns took their toll on my hard drive and Windows will no longer boot, complaining of corrupt files. It then occurred to me that this could have been the problem all along but I also got the same thing under Linux, which boots from a separate hard drive (and is now running fine with my old graphics card in). The X-server would get as far as the login page, display some exciting visual artefacts before completely seizing up forcing me to turn the whole machine off again. :wacko:

My hardware is a GeForce 9800GX2 which I use in non-SLi mode to expose both CUDA devices, both of which are used by my program for separate tasks. The Linux booting did reveal something: the artefacts only occurred on one of my two monitors, implying the issue is localised to the card which presumably was the one running the problem kernel (the 9800GX2 is essentially two cards in one package with each providing a single monitor output).

So, does anyone have any idea what happened? I assume that one of the cards is fried and I’m guessing it’s probably the memory, but how did CUDA cause it? And more importantly, before I reach for my wallet and get a pair of 285s, is it likely to happen again? The only thing I can think of is that the card was on the way out anyway and a bit of use just pushed it over the edge (it’s seen some reasonable use over the past year). Has anyone else experienced problems when accidentally reading from an incorrect memory location? (I’m sure it’s not just me who occasionally gets this sort of thing wrong External Image ).

My other concern is that I’m supposed to be evaluating CUDA for the curriculum of a new course here at university and I now have visions of a lab full of students all breaking their machines… :blink:

Thanks

cbuchner1 · October 27, 2009, 1:38pm

Feel free to post a card killer executable. I am not afraid to run it, because i don’t think that CUDA software can truly break hardware. ;)

Christian

seibert · October 27, 2009, 10:42pm

Yeah, I crash machines running CUDA all the time without doing permanent hardware damage. I did destroy a GTX280 after about a year of relatively high usage, but it wasn’t tied to any specific kernel. That was 1 out of about 10 cards I’ve used over the last 3 years, so my personal failure rate has been pretty good. (And definitely go with a vendor that has a good RMA procedure so that you can get broken cards replaced.)

I think you were just unlucky and had a hardware failure while running CUDA.

seibert · October 27, 2009, 10:44pm

I should say, I’ve wedged a machine hard enough to require a full power cycle (rather than just punching the reset switch) to get the graphics cards happy again.