Linux Kernel Crashes under 260.19.21 Investigating Linux Kernel Crashes

Sarnath · December 19, 2010, 2:55pm

Oh my God! This is crazy! Thanks for posting it!

I will try it this week on my linux box running Ubuntu 9.04 and 3.2 driver/toolkit.

cudesnick · December 20, 2010, 7:22am

I would certainly love to hear how your experiment went.

If my “crash test” fails to crash your box, do you mind also stressing the disk a bit while running the test? Smth. like yes > some-file.txt in a different xterm window would probably do. I have a feeling that the crash is related to heavier disk activity.

It goes without saying, if “successful”, such testing might cause file system corruption. I haven’t observed anything of this kind, though.

Sarnath · December 20, 2010, 7:28am

Sure, I will do it this week. We have some activities lined up for first part of the week.

Will update you before Christmas :-)

cudesnick · December 21, 2010, 6:00am

Thinking of above, I tried different power cabling configurations and also found an extra plug in the motherboard to supply some kind of optional PCIE power through the standard old drive connector, so I plugged it in. Still same reboots.

I was looking at “Thermal Settings” tab within nvidia-settings utility and noticed that the reboot often happens when the GPU Internal temperature goes above 80degC and board (presumably GPU board) temperature goes above 55degC. At the same time, I’ve never seen the GPU fan speed go above 55%, as displayed in the same window.

Hence a couple of questions:

Could you please report what kind of GPU Internal and GPU board temperatures you normally see when running intense GPGPU applications, e.g. an example from CUDA SDK in a loop? Nvidia-settings utility offers an easy way of checking this out. Referencing the GPU card make and model would help.
If the board gets hot, what fan speed are you observing (it also is reported by nvidia-settings)?
What’s the GPU thermal shutdown threshold if there is any?

Thanks!

cudesnick · January 18, 2011, 4:23am

OK, I have more unpleasant surprises to report.

I’ve bought another EVGA card, GTX570 superclocked (I’ve seen reports that superclocked cards are built around better chips), in hope that the particular sample of EVGA GTX470 card I used in my previous tests was bad. Here’s what I see:

I installed my new GTX570 alone into the motherboard (i.e. I removed my old GTX470). I found GTX570 to be LESS stable than my old GTX470: it takes even fewer iterations of BlackSholes code from the SDK (in a way documented a few posts above) to reboot the computer. The executable did not change since the previous test.
If two cards GTX570 and GTX470 are inserted into the motherboard, and one instance of BlackSholes code runs on GTX570, while GTX470 stays idle, then the computer always reboots during the first invocation of the code. This means I have a fully reproducible crash case caused by the invocation of SDK code on my system.
I cannot run two instances of my custom CUDA code on two different cards. This causes an almost instant reboot.

I’ve been talking to EVGA tech support (pleasant surprise: they worked today, which is a US holiday). They are OK with replacing the motherboard under standard warranty terms, but they are fairly certain the motherboard is not to be blamed. They are also saying they only support Windows, so they cannot really tell whether I’m experiencing a software or a hardware issue. I really don’t want to reproduce the problem in Windows: this will take quite a bit of time.

Hence my question: Is there a way to tell whether the crash of a system obviously caused by invoking an official nVidia software is due to a software or a hardware issue?

Thanks!

Alanw · January 18, 2011, 5:33pm

OK, I have more unpleasant surprises to report.

I’ve bought another EVGA card, GTX570 superclocked (I’ve seen reports that superclocked cards are built around better chips), in hope that the particular sample of EVGA GTX470 card I used in my previous tests was bad. Here’s what I see:

I installed my new GTX570 alone into the motherboard (i.e. I removed my old GTX470). I found GTX570 to be LESS stable than my old GTX470: it takes even fewer iterations of BlackSholes code from the SDK (in a way documented a few posts above) to reboot the computer. The executable did not change since the previous test.

If two cards GTX570 and GTX470 are inserted into the motherboard, and one instance of BlackSholes code runs on GTX570, while GTX470 stays idle, then the computer always reboots during the first invocation of the code. This means I have a fully reproducible crash case caused by the invocation of SDK code on my system.

I cannot run two instances of my custom CUDA code on two different cards. This causes an almost instant reboot.

I’ve been talking to EVGA tech support (pleasant surprise: they worked today, which is a US holiday). They are OK with replacing the motherboard under standard warranty terms, but they are fairly certain the motherboard is not to be blamed. They are also saying they only support Windows, so they cannot really tell whether I’m experiencing a software or a hardware issue. I really don’t want to reproduce the problem in Windows: this will take quite a bit of time.

Hence my question: Is there a way to tell whether the crash of a system obviously caused by invoking an official nVidia software is due to a software or a hardware issue?

Thanks!

Hello,

I found this thread on google, and thankfully it’s still active as I have a somewhat similar issue.

I’m using an EVGA Classified SR-2 dual xeon rig with a new PNY Quadro 4000 (fermi gpu) card on Gentoo Linux. I’m not writing my own Cuda apps, but I am using Mari 3D paint application, which does make heavy use of the GPU.

Last night I ran a particularly expensive process in Mari which I noticed caused my GPU temp to go up to 90C, and my machine rebooted. The first time this happened it only killed X and dropped me back to tty, but last night was a little more severe as it rebooted. I’m also considering contacting EVGA after reading your post, but I don’t feel confident they’ll be able to help me since I’m also a Linux user. Thankfully Mari also runs on Windows. I’ll try that tonight and see if I can get it to reboot my machine during heavy GPU related activity.

Before I had the EVGA Classified SR-2 I ran the same hardware & software (dual xeon + fermi Quadro 4000) on an Asus Z8PE-D12 server board. I also used Mari extensively on that board, and never had a single crash. I hate to think this would have anything to do with the motherboard though, as I really would hate going back to the server board.

I did notice that my GPU fan speed seems to stay the same (36RPM) even under heavy load. I’m going to try setting Coolbits “4” in xorg.conf so that nvidia-settings will allow me to adjust the fan speed manually. Perhaps the EVGA boards simply don’t like hot GPU’s?

I’ll post back.

-Alan

[edit] forgot to mention, i’m using 260.19.29.

cudesnick · January 18, 2011, 7:33pm

Alan,

Your information is very helpful.

Regarding the fan speed: it does fluctuate on all my boards (Fermi GTX470, Fermi GTX570) and is certainly not constant, irrespectively of the “coolbits” setting. The fluctuation is not huge though, in the range of 5-10% around its mean value of about 45%. The fan speed is higher for GTX570, than for GTX470, in line with what I remember reading somewhere. The temperatures of my cards under load stay within 60+ to 72 deg and are lower than yours.

I have no experience with Quadro cards. I would consider stable fan speed to be surprising, given how hot your card gets.

Looking forward to hearing more from you on this issue.

Sarnath · January 19, 2011, 7:22am

Hi Cudesnick,

I tested the black scholes on an ubuntu 10.04, 64-bit, CUDA 3.2 envmt. I am unable to reproduce your problem.

Also, I see that I get very high performance number on my GT200 based card.

Not sure why GTX470 is returning such bad numbers. The SDK gencodes for SM10 as well as SM20. So, may be something to do with the occupancy?

Anycase, your results and the subsequent behaviour does not sound normal. Can you just run profiler and note the occupancy on your GTX470?

Best Regards,

Sarnath

Alanw · January 19, 2011, 2:21pm

I haven’t done enough testing to be sure of anything, but for now I think the system reboot was nvidia driver related. I rolled back to 260.19.21 and have been running smooth for a little while. However, I hadn’t seen a reboot like that in a while either, so it’s definitely too early to tell.

cudesnick · January 19, 2011, 4:53pm

Sarnath,

Thanks for your test.

I cannot reproduce my older performance numbers at the moment. I’ve just shipped my GTX470 for the upgrade to EVGA, as part of their step-up program. The GTX570 I have doesn’t let me get to the point when the performance numbers are printed out: my box reliably reboots before then.

I’m successfully running my own CUDA-based code on my GTX570, but only in a very light fashion: at most a single instance of my executable at a time and when no heavy load is exerted on the machine by other software. The performance I’m observing on GTX570 and my executable is in line with my expectations: GTX570 runs more than twice faster than a GTX200 card and a very old version of CUDA I have in a different machine.

Regarding your observation that my GTX470 performance numbers are so much worse than yours: I might have negatively affected my numbers by inserting some logging into the SDK example code, as evident from the log with my performance numbers. This logging might have introduced a major delay, given how quickly the kernel returns.

By any chance do you have access to a Fermi card, on which you could try to reproduce my crash case?

Thanks!

cudesnick · January 19, 2011, 4:56pm

Alan,

Thanks for the update.

How about the temperature of your card after the driver roll-back? Did it go down?

How about the fan speed? Do you see any variation as you load the card?

Thanks!

Sarnath · January 20, 2011, 5:19am

Hi Cudesnick,

Now I see why your perf numbers had dipped. Understandable on the light of logging.
I dont have a FERMI card. But earlier you had reported that GTX280 performs solid with CUDA 2.3
Can’t you replace the FERMI card with a GTX280 on CUDA 3.2 and check?
May be, it is a FERMI + driver related issue?

Few things:

Have you checked the motherboard BIOS? Is it running the latest version?
Try enabling the profiler and see if that helps. Profiler brings down the clock and probably can make the GPU a bit stable.

Best REgards,
Sarnath

cudesnick · January 20, 2011, 6:59am

Sarnath,

Bios is the latest version.

Trying out my GTX280 in the motherboard of the unstable computer is probably a good idea. I’ll do this later this week.

Thanks!

Alanw · January 24, 2011, 2:13am

Not a problem.

My card temps are the same. The quadro 4000 is just a hot card I guess. This might be attributed to the single slot form-factor, and also

the air flow in my case. I’m tempted to re-apply some better tim, but lack the proper tools currently.

I’ve been setting the GPU fan to 70rpm full-time to combat this, and it’s working fine. A little noisy, but my

temps are generally between 50-82C now.

I did notice slight variation in my fan speed as you mentioned. I simply hadn’t payed enough attention before.

I’m still stable, so just keeping my fingers crossed for now.

-Alan

cudesnick · January 31, 2011, 11:06pm

Looks like I’ve resolved my problem.

As SPWorley suggested a while ago, it was the PSU (power supply) that was causing the issue. I’ve tried a couple of other PSUs and the system was very stable, under all the tests that were causing system instabilities reported earlier in this thread. I am now running two Evga GTX570s, with two CUDA processes on each of them and I haven’t seen a crash over the last week.

To their credit, Thermaltake immediately accepted the return on the PSU and sent me the new one even before receiving the one I was returning from me. The new Thermaltake PSU is not causing any issues.

My current theory is that the orignal, broken PSU was power-cycling my system on the basis of some incorrect measurement of either power consumption or temperature. I don’t know a lot about how modern power supplies operate, so this is just a theory.

I hope this information will help somebody else debug their similar problem in the future.

Thanks for all your advises and comments!

Sarnath · February 1, 2011, 4:35am

Vow! This is good to know! Congrats! And Thanks for sharing this info,

Best regards,
Sarnath