NVidia kernel module trouble I am having trouble with the kernel modu

Hi all,

I am having big stability problems with the kernel module. I have Dell XPS 720 H2C with Quad Core processor, 2x 8800 GTX cards. When I boot with the maxcpus=1 option, the computer runs happily, but as soon as I use all cores, running matlab for more than 15 minutes results in spontanous reboots. I am using Ubuntu 7.10 x86_64

  • Is anybody else experiencing these kind of problems? I will install Fedora Core now, as I have read people having succes with that distro.

  • Is it possible to use the opensource nv module on 1 8800GTX for use with X Windows and the binary module on the second 8800GTX for use with cuda?

I am really sorry to say so, but advising users to only use 1 processor when using the binary NVIDIA module is not really encouraging people to see NVIDIA as a HPC-supplier…

I hope someone has some tips for me to make it all work reliably. And I also hope that NVIDIA will come with an open-source CUDA driver :)

Any chance your power supply is not quite enough to support all your hardware? There will be a good spike in demand when you start doing heavy processing.

I too have had a few problems with kernels locking up (the 5s limitation does not always seem to save me) and once or twice the machine rebooting. I’m not sweating it much though… it is annoying but this is a new technology at the v1.0 level. The software issues will all get worked out in time.

It seems a power supply problem.
I have run VMD with 4 cards for days on Linux (RHEL4 up5) and never seen a reboot or lock up.

DenisR,
Did you purchase the system with this configuration?
Have you verified that you’re using the latest BIOS?
Are you able to setup a serial console?
Why do you believe this to be an nvidia kernel module problem?

Also, please generate and attach an nvidia-bug-report.log.

thanks,
Lonni

We’re using a Quad core Intel with 2 8800 GTX with no problems at all (Debian distr). X is running on 1 card, which can be used for calculations at the same time.
The server does have redundant and high end power supplies.

Okay, I will try to comment on all your questions.

The PC is bought as is. It’s a standard Dell XPS 720 H2C. The power supply is 1Kwatt, so I don’t suspect that is the problem. It has 2 SATA drives and a DVD RW drive, so no huge amounts of other hardware requiring power.

As to why I think it is a kernel module problem:

  • I followed the recipe at the nvzone forum regarding instabilities to boot with pci=nommconf, idle=poll and maxcpus=1. Then the system is running smoothly. No reboots happening at all. I see that this might still leave the option open that the powersupply is not ok, since 3 cores will not comsume power then.
  • For reboots I do not have to do any calculations on the 8800’s. I just run standard matlab-processing on the CPU’s
  • When using the open-source driver I do not get any reboots when doing the same standard matlab processing on the CPU’s

I will check for the BIOS version and also how to generate a nvidia-bug-report.log on friday (I will not be near the machine until then & I will have to reinstall linux (messed it up a little, it’s been too long since I used linux a lot…)

I will then also take one 8800 from the machine, if it is stable then, and not after putting it back in, I will complain about the PSU to Dell.

I have one other question. On the cuda page the linked driver is version 11, i now see a version 19, and a beta 23. do these also support cuda?

Thanks for all your pointers, I am starting to have faith again :)
Dènis

Only 100.14.11 supports CUDA-1.0. However, if, as you stated, you’re not using CUDA with Matlab, then this should not matter.

Not using it yet :D

In my experience, when you have machines rebooting or hanging without any clear reasons, your hardware is defect. It could be the GPU, but it also happens when you’re running matlab. So if it is your GPU-card, it is the one with X running.
Matlab is also very memory intensive, so it could also be that your memorychips/banks are broken.
I would strip the computer and run it with the bare essentials. If it runs fine, put everything back, 1 by 1, and check each time. If it gives problems even in the first case, replace the memory. That does not help? Ship it back ;-)

I will first let memtest86 run friday, as the matlab script is using approx. 2 Gb of memory. When that turns out ok, I will take the card X was running on away and see if that helps. And otherwise Dell may replace my less than a week old machine…

Could it maybe be that my wall-outlet doesn’t provide enough power? We are currently using a PC/monitor on almost all our outlets, so I think the total current for the group could get high. But then again, that should burn out a fuse most likely I guess…

Hmm, all those things to try when I am babysitting at home ;)

Thanks for all the help here, I will post an update when I know more.

Dènis

Well, so far so good. What did the trick for me is idle=poll on the linux bootcommandline. Have been running for a while now, and the matlabscripts that triggered a reboot have been happily running for half a day now.

Running Fedora Core 8 with the driver from the linva site. Version 100.14.19 All Cuda examples work except the ones requiring OpenGl, but they don’t work with more than 1 card.

Spoke too soon… :(
I have had memtest86 run for 3 days without errors, so I guess memory is ok.
Now I have taken out 1 8800GTX, if that runs ok, it is either the 8800GTX or the powersupply (I think)
If that crashes I swap 8800GTX cards. If that crashes, I have to guess it is one of the CPU cores (since maxcpus=1 works allright) Maybe I can then even find out which one it is (with maxcpus=2 and maxcpus=3)

Currently running on a Dell 720 as well, but under CentOS 5.0 (with current kernel from updates). There is only one GTX in the system, but running smoothly with all cores enabled. Try a different distribution perhaps? (CentOS is RHEL without the branding btw).

Yes, I’ve been thinking about installing CentOS 5, but given the fact that 2 distributions give me the same trouble makes me suspect the CPU or the PSU is at fault in my case. Removing one 8800GTX did not help at least. Now I am running a stresstest with only one core to see if it is really stable in that case. And if that is true, I’ll be calling Dell.

Not necessarily. I’ve had consistent problems with running high spec machines off the consumer oriented linux distributions such as Ubuntu. The kernels tend to be tuned for the wrong things (ie, improper SMP locking mechanics, features not stability, etc.).

Okay, well then I’ll download CentOs5 again tomorrow to give it a whirl. I have tried Ubuntu 7.10 & Fedora Core 8 on this machine.

Well, I even had spontanous reboots when installing CentOS 5. They went away when I booted the install kernel with maxcpus=1. So I have notified Dell that my CPU wants replacing. Reinstalled FC8 and an now getting my feet wet with CUDA (using 1 CPU-core )

Please let us know if the problem is fixed when you have your new hardware. I’m very interested.

I haven’t experienced spontaneous reboots, but the nForce chipset (and specifically the Mediashield fakeraid controller) is not well supported under RHEL 5 (or CentOS), which means you can encounter storage controller related lockups. The workaround is to not use RAID on that motherboard…

Well, I have learnt not to jump too soon, but Dell replaced my mainboard and CPU last monday, and in the past week I have been hapilly using all 4 cores of my CPU without problems. So it seems that I had a faulty CPU/MB…

It feels good to walk on all four legs ;)