S870 causes kernel panic Device query of S870 crashes kernel

Hello,
Has anyone been able to use the S870? I’m running RHEL5.1 on the system that I have the PCI-e card that goes to the 1U S870, and every time I run the deviceQuery binary the kernel crashes with a bunch of lines like:
nvidia…

I’ve tried only allocating < 4 GBytes to the kernel in case it doesn’t behave well in 32-bit kernel mode with > 4 GBytes of available memory. I’ve had no problem using the C870’s inside the same system running RHEL5.1. I have the latest drivers (171.05)

Thanks for your help!
David

We have several S870 systems attached to systems running RHEL-5 in our lab, and haven’t experienced any kernel panics.

It wouldn’t surprise me if you were getting some kind of stack overflow if you’re trying to use 4 GPUs in a 32bit OS. Can you capture/post the full kernel panic output?

At a minimum, please generate and attach an nvidia-bug-report.log, and confirm that you’re using the latest motherboard BIOS.

Thanks!
Here’s the last screen from the crash (the previous screen(s) scrolled by so I’m not able to capture)
[] iounmap+0x9e/0xc8
[] _nv003837rm+0x3d/0x44 [nvidia]
[] _nv004713rm+0x1e1/0xf0 [nvidia]
[] _nv004720rm+0x371/0x381 [nvidia]
[] _nv004724rm+0x4f/0x1fd [nvidia]
[] _nv004427rm+0x19f/0x24e [nvidia]
[] _nv004432rm+0x47f/0x48e [nvidia]
[] _nv004465rm+0x180/0x330 [nvidia]
[] _nv004418rm+0x65/0x11f [nvidia]
[] _nv004466rm+0x30/0x3d [nvidia]
[] _nv003928rm+0x27/0x32 [nvidia]
[] _nv002952rm+0x21a/0x42d [nvidia]
[] rm_init_adapter+0x8b/0xd9 [nvidia]
[] nv_kern_open+0x31c/0x43e [nvidia]
[] chrdev_open+0x117/0x132
[] chrdev_open+0x0/0x132
[] __dentry_open+0xc7/0x1ab
[] nameidata_to_filp+0x19/0x28
[] do_filp_open+0x2b/0x31
[] sys_chown+0x3b/0x43
[] do_sys_open+0x3e/0xae
[] sys_open+0x16/0x18
[] syscall_call+0x27/0xb

Also, I’ve attached the gzipped bug report file. The BIOS is from Mid-November and they have an update since then but only a few things were fixed (and nothing looked pertinent) - I’ll try to get that in when I can.

Also, I’ve plugged only one of the PCI cards & cables into the S870 - I couldn’t figure out from the documentation if we’re supposed to plug in both, or will I only get 2 GPU’s with the one card? Before I had 2 C870’s in the chassis and it worked fine with those.

Thank you!
nvidia_bug_report.log.gz (32.4 KB)

Hey all,
I’ve tried installing a x8 card on a Dell PowerEdge 2950 and I can’t get the 171.05 drivers to load as a kernel module so that’s a dead-end too (I’ve tried RHEL5.0 and RHEL5.1)
Netllama, what is the specific os & server that you are running on there @ nVidia? I’ll try to mirror that config as much as I possibly can just to get it working - I’ve been trying for a long time now and just keep hitting roadblocks.
Also, can anyone explain why I get 10 pci bridge devices and 2 3D controllers when i do lspci for each 16-lane controller connected to the S870?
Thanks!
David

I’m not sure what you mean by an “x8 card”, however 171.05 is only meant to support the S870. All other hardware should be using the 169.xx drivers.

We’ve tested the S870 with RHEL-5.0-x86_64 in an HP DL140 server (using the latest BIOS) and haven’t experienced any problems. We’ve also tested with an HP xw9400 & xw8600 workstations using RHEL-4.5-x86_64 and RHEL-5.0-x86_64 and didn’t experience any issues.

Have you applied the latest motherboard BIOS yet? Note that often a BIOS update typically has many fixes, and only the highlights are publicly documented. So just because you don’t see something listed that matches your problem doesn’t mean that it wasn’t fixed.

If at all possible, I’d suggest trying a 64bit Linux distribution, even if its just to verify that the problem does or does not exist in that environment.

Thanks!

Bad documentation on my part - by x8 I meant the 8-lane PCI express card that connects to the S870. I have not applied the latest BIOS yet - so I’ll get to that too. We were going to go down the 64-bit road so we’ll also see about that. Thanks again!

David

I installed RHEL5.1 x86_64 and everything went smoothly as per instructions, and I am able to see all 4 GPU’s in the S870 - so that’s great!

I haven’t heard of anyone yet who has used one of these in the 32-bit kernel mode - any chance of nVidia testing it out that way ? :) I know this product is in its infancy, but we were able to connect to 2 C870’s in 32-bit mode yet weren’t able to use the same number of C870’s inside the S870 - and we tried just about everything we could without changing driver code :)

Have you tried what is suggested in the release notes?

On some Linux releases, due to a GRUB bug in the handling of upper
memory and a default vmalloc too small on 32-bit systems, it may be
necessary to pass this information to the bootloader:

vmalloc=256MB, uppermem=524288

Example of grub conf:

title Red Hat Desktop (2.6.9-42.ELsmp)
root (hd0,0)
uppermem 524288
kernel /vmlinuz-2.6.9-42.ELsmp ro root=LABEL=/1 rhgb quiet vmalloc=256MB
pci=nommconf
initrd /initrd-2.6.9-42.ELsmp.img

Thanks - I did try that originally, and it still crashed in the same way :(

As for the 64-bit installation, I looked through my logs after the 64-bit install - everything was well and good until a little bit later, then got the following (it didn’t crash the kernel but I had to reboot to be able to talk to the teslas again)

Mar 7 02:01:02 tesla1 kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 171.05 Tue Jan 22 15:58:48 PST 2008
Mar 7 02:01:28 tesla1 kernel: BUG: soft lockup detected on CPU#3!
Mar 7 02:01:28 tesla1 kernel:
Mar 7 02:01:28 tesla1 kernel: Call Trace:
Mar 7 02:01:28 tesla1 kernel: [] softlockup_tick+0xd5/0xe7
Mar 7 02:01:28 tesla1 kernel: [] update_process_times+0x42/0x68
Mar 7 02:01:28 tesla1 kernel: [] smp_local_timer_interrupt+0x23/0x47
Mar 7 02:01:28 tesla1 kernel: [] smp_apic_timer_interrupt+0x41/0x47
Mar 7 02:01:28 tesla1 kernel: [] apic_timer_interrupt+0x66/0x6c
Mar 7 02:01:28 tesla1 kernel: [] __smp_call_function+0x62/0x8b
Mar 7 02:01:28 tesla1 kernel: [] :nvidia:_nv003936rm+0x20/0x22
Mar 7 02:01:28 tesla1 kernel: [] :nvidia:_nv008191rm+0x2a/0xc7
Mar 7 02:01:28 tesla1 kernel: [] :nvidia:_nv001568rm+0x29/0xf8
Mar 7 02:01:28 tesla1 kernel: [] :nvidia:_nv001588rm+0xb7/0x11b
Mar 7 02:01:28 tesla1 kernel: [] :nvidia:_nv006451rm+0x87/0xd5
Mar 7 02:01:28 tesla1 kernel: [] :nvidia:_nv006473rm+0xde/0x125
Mar 7 02:01:28 tesla1 kernel: [] :nvidia:_nv006434rm+0x46/0xc3
Mar 7 02:01:28 tesla1 kernel: [] :nvidia:_nv006439rm+0x57/0x69
Mar 7 02:01:28 tesla1 kernel: [] :nvidia:_nv001754rm+0x1ae/0x22b
Mar 7 02:01:28 tesla1 kernel: [] :nvidia:_nv000322rm+0x23/0x28
Mar 7 02:01:28 tesla1 kernel: [] :nvidia:_nv001970rm+0x60/0x6d
Mar 7 02:01:28 tesla1 kernel: [] :nvidia:_nv001975rm+0x4b/0xe1
Mar 7 02:01:28 tesla1 kernel: [] :nvidia:_nv002015rm+0xb9/0xf4
Mar 7 02:01:28 tesla1 kernel: [] :nvidia:_nv008146rm+0x1b/0xb1
Mar 7 02:01:28 tesla1 kernel: [] :nvidia:_nv003003rm+0xec/0x112
Mar 7 02:01:28 tesla1 kernel: [] :nvidia:rm_disable_adapter+0x8f/0xe5
Mar 7 02:01:28 tesla1 kernel: [] :nvidia:nv_kern_close+0x21b/0x3a0
Mar 7 02:01:28 tesla1 kernel: [] __fput+0xae/0x198
Mar 7 02:01:28 tesla1 kernel: [] filp_close+0x5c/0x64
Mar 7 02:01:28 tesla1 kernel: [] put_files_struct+0x6c/0xc3
Mar 7 02:01:28 tesla1 kernel: [] do_exit+0x2d2/0x89d
Mar 7 02:01:28 tesla1 kernel: [] cpuset_exit+0x0/0x6c
Mar 7 02:01:28 tesla1 kernel: [] get_signal_to_deliver+0x42c/0x45a
Mar 7 02:01:28 tesla1 kernel: [] do_notify_resume+0x9c/0x7a9
Mar 7 02:01:28 tesla1 kernel: [] chrdev_open+0x155/0x183
Mar 7 02:01:28 tesla1 kernel: [] chrdev_open+0x0/0x183
Mar 7 02:01:28 tesla1 kernel: [] __dentry_open+0x101/0x1dc
Mar 7 02:01:28 tesla1 kernel: [] do_filp_open+0x2a/0x38
Mar 7 02:01:28 tesla1 kernel: [] sys_chown+0x45/0x56
Mar 7 02:01:28 tesla1 kernel: [] audit_syscall_exit+0x2fb/0x319
Mar 7 02:01:28 tesla1 kernel: [] int_signal+0x12/0x17

Then after reboot, I did ./deviceQuery and got the following - so it looks like the previous result was from a deviceQuery too, note this looks different, but after this one I was able to use the teslas (I followed this with another deviceQuery and a bandwidth test and no errors!)

Mar 10 10:16:43 tesla1 kernel: BUG: soft lockup detected on CPU#4!
Mar 10 10:16:43 tesla1 kernel:
Mar 10 10:16:43 tesla1 kernel: Call Trace:
Mar 10 10:16:43 tesla1 kernel: [] softlockup_tick+0xd5/0xe7
Mar 10 10:16:43 tesla1 kernel: [] update_process_times+0x42/0x68
Mar 10 10:16:43 tesla1 kernel: [] smp_local_timer_interrupt+0x23/0x47
Mar 10 10:16:43 tesla1 kernel: [] smp_apic_timer_interrupt+0x41/0x47
Mar 10 10:16:43 tesla1 kernel: [] apic_timer_interrupt+0x66/0x6c
Mar 10 10:16:43 tesla1 kernel: [] __smp_call_function+0x62/0x8b
Mar 10 10:16:43 tesla1 kernel: [] :nvidia:_nv003936rm+0x20/0x22
Mar 10 10:16:43 tesla1 kernel: [] :nvidia:_nv008191rm+0x2a/0xc7
Mar 10 10:16:43 tesla1 kernel: [] :nvidia:_nv001568rm+0x29/0xf8
Mar 10 10:16:43 tesla1 kernel: [] :nvidia:_nv001588rm+0xda/0x11b
Mar 10 10:16:43 tesla1 kernel: [] :nvidia:_nv006451rm+0x87/0xd5
Mar 10 10:16:43 tesla1 kernel: [] :nvidia:_nv006502rm+0x160/0x1a5
Mar 10 10:16:43 tesla1 kernel: [] :nvidia:_nv006434rm+0x5a/0xc3
Mar 10 10:16:43 tesla1 kernel: [] :nvidia:_nv006439rm+0x57/0x69
Mar 10 10:16:43 tesla1 kernel: [] :nvidia:_nv001754rm+0x1ae/0x22b
Mar 10 10:16:43 tesla1 kernel: [] :nvidia:_nv000322rm+0x23/0x28
Mar 10 10:16:43 tesla1 kernel: [] :nvidia:_nv001970rm+0x60/0x6d
Mar 10 10:16:43 tesla1 kernel: [] :nvidia:_nv001975rm+0x4b/0xe1
Mar 10 10:16:43 tesla1 kernel: [] :nvidia:_nv002015rm+0xb9/0xf4
Mar 10 10:16:43 tesla1 kernel: [] :nvidia:_nv008146rm+0x1b/0xb1
Mar 10 10:16:43 tesla1 kernel: [] :nvidia:_nv003003rm+0xec/0x112
Mar 10 10:16:43 tesla1 kernel: [] :nvidia:rm_disable_adapter+0x8f/0xe5
Mar 10 10:16:43 tesla1 kernel: [] :nvidia:nv_kern_close+0x21b/0x3a0
Mar 10 10:16:43 tesla1 kernel: [] __fput+0xae/0x198
Mar 10 10:16:43 tesla1 kernel: [] filp_close+0x5c/0x64
Mar 10 10:16:43 tesla1 kernel: [] put_files_struct+0x6c/0xc3
Mar 10 10:16:43 tesla1 kernel: [] do_exit+0x2d2/0x89d
Mar 10 10:16:43 tesla1 kernel: [] cpuset_exit+0x0/0x6c
Mar 10 10:16:43 tesla1 kernel: [] tracesys+0xd5/0xe0

I am seeing this same problem on an x86_64 platform (HP DL160), with a trace that looks very similar to the one here. Was there ever a resolution to this?

Thanks,

–Joe

Does this reproduce with the CUDA_2.0 driver?

I have not yet tried the 2.0 release. I have been using 1.1. My setup worked fine for about a month and a half, but this morning when we came in, my co-worker tried to run something on the S870 and we started seeing this error. We can now reproduce it by running nvidia-smi, even across reboots. It is, at this point, unusable.

If the 2.0 release is thought to help, I will give it a shot.

If this had been working, and then suddenly stopped, then I’d suggesting isolating what changed.

The only change was my adding a new user account about two weeks ago. Other than that, only usage of the system was by non-privileged users.

I will see if I can walk backwards and see what started the problem.

Thanks,

–Joe

Is the nvidia-smi command installed by CUDA, or is it the driver package (i.e. NVIDIA-Linux-x86_64-171.06.01-pkg2.run)?

I am getting failures running nvidia-smi, so if that doesn’t touch the CUDA environment, then my problem is CUDA independent…

Can you try to install this driver?

http://www.nvidia.com/object/linux_display…_171.06.01.html

I installed 171.06.01 yesterday.

Unfortunately, I am seeing the exact same behavior – when running nvidia-smi, the BUG shows up during the “Getting unit information…” phase of the program (the discovery and attaching steps both go fine; the errors show up in my logs during the getting unit info stage). All nvidia-smi tells me is “Failed to read byte,” which it repeats a few times (which I’d guess is for each device?). I’ve never seen it exit, it usually seems to hang. This morning maybe I’ll do it again and let it run on to see if it ever actually finishes/dies. The last few days, I’ve just rebooted after 5 minutes of it not doing anything, since the system was clearly hosed.

I really can’t see anything that has changed on my system in the last month to cause this problem. Yesterday, before I installed 171.06.01, I did a yum update, caught a bunch of new stuff, including a kernel update, so I rebooted before installing the driver bundle. After reboot, I did the 171.06.01 install, rebooted again to be sure, then tried to probe, but still having problems.

I let nvidia-smi run to completion this morning, and this is the result:

$ nvidia-smi

Gpus found in probe:
Found Gpuid 0x1c000
Found Gpuid 0x1a000
Found Gpuid 0x11000
Found Gpuid 0xf000
Attaching all probed Gpus…OK
Getting unit information…Failed to read byte
Failed to read byte
Failed to read byte
Failed to read byte
Failed to read byte
Failed to read byte
OK
Getting all static information…
Failed to read byte
Failed to read byte
Couldn’t get unit info!
failed!
*** glibc detected *** nvidia-smi: double free or corruption (top): 0x00000000151578f0


======= Backtrace: =========
/lib64/libc.so.6[0x395f26f4f4]
/lib64/libc.so.6(cfree+0x8c)[0x395f272b1c]
nvidia-smi[0x40ecdf]
nvidia-smi[0x40f0cf]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x395f21d8a4]
nvidia-smi(strcat+0x42)[0x40193a]
======= Memory map: ========
00400000-0041c000 r-xp 00000000 08:02 691434 /usr/bin/nvidia-smi
0051c000-00520000 rwxp 0001c000 08:02 691434 /usr/bin/nvidia-smi
15157000-15178000 rwxp 15157000 00:00 0
395e000000-395e01a000 r-xp 00000000 08:02 1142426 /lib64/ld-2.5.so
395e219000-395e21a000 r-xp 00019000 08:02 1142426 /lib64/ld-2.5.so
395e21a000-395e21b000 rwxp 0001a000 08:02 1142426 /lib64/ld-2.5.so
395f200000-395f346000 r-xp 00000000 08:02 1142428 /lib64/libc-2.5.so
395f346000-395f546000 —p 00146000 08:02 1142428 /lib64/libc-2.5.so
395f546000-395f54a000 r-xp 00146000 08:02 1142428 /lib64/libc-2.5.so
395f54a000-395f54b000 rwxp 0014a000 08:02 1142428 /lib64/libc-2.5.so
395f54b000-395f550000 rwxp 395f54b000 00:00 0
396f400000-396f40d000 r-xp 00000000 08:02 1142587 /lib64/libgcc_s-4.1.2-20070626.so.1
396f40d000-396f60d000 —p 0000d000 08:02 1142587 /lib64/libgcc_s-4.1.2-20070626.so.1
396f60d000-396f60e000 rwxp 0000d000 08:02 1142587 /lib64/libgcc_s-4.1.2-20070626.so.1
2aaaaaaab000-2aaaaaaac000 rwxp 2aaaaaaab000 00:00 0
2aaaaaac5000-2aaaaaac7000 rwxp 2aaaaaac5000 00:00 0
2aaaac000000-2aaaac021000 rwxp 2aaaac000000 00:00 0
2aaaac021000-2aaab0000000 —p 2aaaac021000 00:00 0
7fff85994000-7fff859a9000 rwxp 7fff85994000 00:00 0 [stack]
ffffffffff600000-ffffffffffe00000 —p 00000000 00:00 0 [vdso]
Aborted

nvidia-smi is part of the driver package, and is not CUDA dependent. Its not clear from your posts whether you’ve tested the CUDA_2.0 driver yet?

I haven’t tried anything CUDA yet in the last two days, because I can’t even get the nvidia-smi probe to successfully complete.

Should I expect CUDA to act differently if nvidia-smi is failing with kernel BUG traces in my logs and the above mentioned types of errors in its output?