Need a program that guarantees a fatal crash to reset my card

asadafag · July 16, 2007, 9:05am

Does anyone have a program that guarantees a fatal CUDA crash? (permanent freeze, blue screen of death, etc. whatever the 1.0 driver can’t recover from)
I just found the random crashes i mentioned in another topic won’t happen between a fatal crash and the next non-fatal crash.
Source code and winXP executable are both welcome.
please send to hqm03ster@gmail.com or reply this topic.

perj · July 16, 2007, 6:56pm

Just try to write a cuda version of a forkbomb, and I’m sure you’ll manage to freeze the blasted card. Since the code on the card is asynchronous to the CPU runtime, I don’t know if the computer will freeze along with the GPU. Make sure to have as many bank conflicts as possible to worsen your forkbomb! External Media

asadafag · July 17, 2007, 3:22am

Well, I tried this :)
It indeed managed to freeze CUDA. But it didn’t reset the card :(
I guess one has to confuse the driver more to reset it…

osiris1 · July 17, 2007, 3:53am

Might as well just turn the power off as that is what you will have to do to recover from the freeze…

asadafag · July 18, 2007, 3:15am

I already tried power off (interval varies from several seconds to a day) before posting this…

wumpus · July 18, 2007, 8:37am

writing to random memory locations is always fun, and fast too, just adapt the parallel Mersenne twister for maximum crashing performance!

osiris1 · July 20, 2007, 5:03am

If that is the case I would suggest you must be referencing unintialised memory somewhere - go thru your code with a fine toothed comb.

asadafag · July 20, 2007, 9:26am

I did that, several times, before I start posting stuff here. Trust me, I’m not the kind of people that ask such questions before making sure it’s not an initialization problem.
I have a memset after every alloc on both CPU and GPU. And the random behavior doesn’t change. The only thing I haven’t initializing is a PBO, which is NEVER read. And .bss should always be initialized in winXP.
I also tried running the program step-by-step to nail down the crash. But adding printf, gets or enough test code around a kernel launch would stop it from crashing. >_<
Basically, things are less likely to crash if extra code slow it down enough. That makes it sounds like temperature-related. But once it crashed when nvcpl.dll reports 58…

eelsen · July 20, 2007, 4:33pm

If you run in emulation mode do you still get errors of some kind? And I think by uninitialized memory, it was possible that osiris meant unallocated memory. ie you’re reading or writing past the end of the space you allocated. I’ve found that running in emulation mode with valgrind can help track down such cases.

asadafag · July 21, 2007, 7:04am

That’s a useful suggestion, but not likely since I allocated everything quite conservatively… I’d try anyway.

My last emulation run was correct. Another one would take hours to complete, though:(

asadafag · July 21, 2007, 9:35am

The new emulation run indeed succeeded.