S870 causes kernel panic Device query of S870 crashes kernel

srawrats · February 22, 2008, 8:12pm

Hello,
Has anyone been able to use the S870? I’m running RHEL5.1 on the system that I have the PCI-e card that goes to the 1U S870, and every time I run the deviceQuery binary the kernel crashes with a bunch of lines like:
nvidia…

I’ve tried only allocating < 4 GBytes to the kernel in case it doesn’t behave well in 32-bit kernel mode with > 4 GBytes of available memory. I’ve had no problem using the C870’s inside the same system running RHEL5.1. I have the latest drivers (171.05)

Thanks for your help!
David

netllama · February 22, 2008, 8:15pm

We have several S870 systems attached to systems running RHEL-5 in our lab, and haven’t experienced any kernel panics.

It wouldn’t surprise me if you were getting some kind of stack overflow if you’re trying to use 4 GPUs in a 32bit OS. Can you capture/post the full kernel panic output?

At a minimum, please generate and attach an nvidia-bug-report.log, and confirm that you’re using the latest motherboard BIOS.

srawrats · February 22, 2008, 9:39pm

Thanks!
Here’s the last screen from the crash (the previous screen(s) scrolled by so I’m not able to capture)
[] iounmap+0x9e/0xc8
[] _nv003837rm+0x3d/0x44 [nvidia]
[] _nv004713rm+0x1e1/0xf0 [nvidia]
[] _nv004720rm+0x371/0x381 [nvidia]
[] _nv004724rm+0x4f/0x1fd [nvidia]
[] _nv004427rm+0x19f/0x24e [nvidia]
[] _nv004432rm+0x47f/0x48e [nvidia]
[] _nv004465rm+0x180/0x330 [nvidia]
[] _nv004418rm+0x65/0x11f [nvidia]
[] _nv004466rm+0x30/0x3d [nvidia]
[] _nv003928rm+0x27/0x32 [nvidia]
[] _nv002952rm+0x21a/0x42d [nvidia]
[] rm_init_adapter+0x8b/0xd9 [nvidia]
[] nv_kern_open+0x31c/0x43e [nvidia]
[] chrdev_open+0x117/0x132
[] chrdev_open+0x0/0x132
[] __dentry_open+0xc7/0x1ab
[] nameidata_to_filp+0x19/0x28
[] do_filp_open+0x2b/0x31
[] sys_chown+0x3b/0x43
[] do_sys_open+0x3e/0xae
[] sys_open+0x16/0x18
[] syscall_call+0x27/0xb

Also, I’ve attached the gzipped bug report file. The BIOS is from Mid-November and they have an update since then but only a few things were fixed (and nothing looked pertinent) - I’ll try to get that in when I can.

Also, I’ve plugged only one of the PCI cards & cables into the S870 - I couldn’t figure out from the documentation if we’re supposed to plug in both, or will I only get 2 GPU’s with the one card? Before I had 2 C870’s in the chassis and it worked fine with those.

Thank you!
nvidia_bug_report.log.gz (32.4 KB)

srawrats · March 6, 2008, 3:07pm

Hey all,
I’ve tried installing a x8 card on a Dell PowerEdge 2950 and I can’t get the 171.05 drivers to load as a kernel module so that’s a dead-end too (I’ve tried RHEL5.0 and RHEL5.1)
Netllama, what is the specific os & server that you are running on there @ nVidia? I’ll try to mirror that config as much as I possibly can just to get it working - I’ve been trying for a long time now and just keep hitting roadblocks.
Also, can anyone explain why I get 10 pci bridge devices and 2 3D controllers when i do lspci for each 16-lane controller connected to the S870?
Thanks!
David

netllama · March 6, 2008, 3:23pm

I’m not sure what you mean by an “x8 card”, however 171.05 is only meant to support the S870. All other hardware should be using the 169.xx drivers.

We’ve tested the S870 with RHEL-5.0-x86_64 in an HP DL140 server (using the latest BIOS) and haven’t experienced any problems. We’ve also tested with an HP xw9400 & xw8600 workstations using RHEL-4.5-x86_64 and RHEL-5.0-x86_64 and didn’t experience any issues.

Have you applied the latest motherboard BIOS yet? Note that often a BIOS update typically has many fixes, and only the highlights are publicly documented. So just because you don’t see something listed that matches your problem doesn’t mean that it wasn’t fixed.

If at all possible, I’d suggest trying a 64bit Linux distribution, even if its just to verify that the problem does or does not exist in that environment.

srawrats · March 6, 2008, 3:33pm

Thanks!

Bad documentation on my part - by x8 I meant the 8-lane PCI express card that connects to the S870. I have not applied the latest BIOS yet - so I’ll get to that too. We were going to go down the 64-bit road so we’ll also see about that. Thanks again!

David

I’m not sure what you mean by an “x8 card”, however 171.05 is only meant to support the S870. All other hardware should be using the 169.xx drivers.

We’ve tested the S870 with RHEL-5.0-x86_64 in an HP DL140 server (using the latest BIOS) and haven’t experienced any problems. We’ve also tested with an HP xw9400 & xw8600 workstations using RHEL-4.5-x86_64 and RHEL-5.0-x86_64 and didn’t experience any issues.

Have you applied the latest motherboard BIOS yet? Note that often a BIOS update typically has many fixes, and only the highlights are publicly documented. So just because you don’t see something listed that matches your problem doesn’t mean that it wasn’t fixed.

If at all possible, I’d suggest trying a 64bit Linux distribution, even if its just to verify that the problem does or does not exist in that environment.

[snapback]338587[/snapback]

srawrats · March 7, 2008, 7:06pm

I installed RHEL5.1 x86_64 and everything went smoothly as per instructions, and I am able to see all 4 GPU’s in the S870 - so that’s great!

I haven’t heard of anyone yet who has used one of these in the 32-bit kernel mode - any chance of nVidia testing it out that way ? :) I know this product is in its infancy, but we were able to connect to 2 C870’s in 32-bit mode yet weren’t able to use the same number of C870’s inside the S870 - and we tried just about everything we could without changing driver code :)

mfatica · March 7, 2008, 9:47pm

Have you tried what is suggested in the release notes?

On some Linux releases, due to a GRUB bug in the handling of upper
memory and a default vmalloc too small on 32-bit systems, it may be
necessary to pass this information to the bootloader:

vmalloc=256MB, uppermem=524288

Example of grub conf:

title Red Hat Desktop (2.6.9-42.ELsmp)
root (hd0,0)
uppermem 524288
kernel /vmlinuz-2.6.9-42.ELsmp ro root=LABEL=/1 rhgb quiet vmalloc=256MB
pci=nommconf
initrd /initrd-2.6.9-42.ELsmp.img

srawrats · March 10, 2008, 10:31pm

Thanks - I did try that originally, and it still crashed in the same way :(

As for the 64-bit installation, I looked through my logs after the 64-bit install - everything was well and good until a little bit later, then got the following (it didn’t crash the kernel but I had to reboot to be able to talk to the teslas again)

Mar 7 02:01:02 tesla1 kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 171.05 Tue Jan 22 15:58:48 PST 2008
Mar 7 02:01:28 tesla1 kernel: BUG: soft lockup detected on CPU#3!
Mar 7 02:01:28 tesla1 kernel:
Mar 7 02:01:28 tesla1 kernel: Call Trace:
Mar 7 02:01:28 tesla1 kernel: [] softlockup_tick+0xd5/0xe7
Mar 7 02:01:28 tesla1 kernel: [] update_process_times+0x42/0x68
Mar 7 02:01:28 tesla1 kernel: [] smp_local_timer_interrupt+0x23/0x47
Mar 7 02:01:28 tesla1 kernel: [] smp_apic_timer_interrupt+0x41/0x47
Mar 7 02:01:28 tesla1 kernel: [] apic_timer_interrupt+0x66/0x6c
Mar 7 02:01:28 tesla1 kernel: [] __smp_call_function+0x62/0x8b
Mar 7 02:01:28 tesla1 kernel: [] :nvidia:_nv003936rm+0x20/0x22
Mar 7 02:01:28 tesla1 kernel: [] :nvidia:_nv008191rm+0x2a/0xc7
Mar 7 02:01:28 tesla1 kernel: [] :nvidia:_nv001568rm+0x29/0xf8
Mar 7 02:01:28 tesla1 kernel: [] :nvidia:_nv001588rm+0xb7/0x11b
Mar 7 02:01:28 tesla1 kernel: [] :nvidia:_nv006451rm+0x87/0xd5
Mar 7 02:01:28 tesla1 kernel: [] :nvidia:_nv006473rm+0xde/0x125
Mar 7 02:01:28 tesla1 kernel: [] :nvidia:_nv006434rm+0x46/0xc3
Mar 7 02:01:28 tesla1 kernel: [] :nvidia:_nv006439rm+0x57/0x69
Mar 7 02:01:28 tesla1 kernel: [] :nvidia:_nv001754rm+0x1ae/0x22b
Mar 7 02:01:28 tesla1 kernel: [] :nvidia:_nv000322rm+0x23/0x28
Mar 7 02:01:28 tesla1 kernel: [] :nvidia:_nv001970rm+0x60/0x6d
Mar 7 02:01:28 tesla1 kernel: [] :nvidia:_nv001975rm+0x4b/0xe1
Mar 7 02:01:28 tesla1 kernel: [] :nvidia:_nv002015rm+0xb9/0xf4
Mar 7 02:01:28 tesla1 kernel: [] :nvidia:_nv008146rm+0x1b/0xb1
Mar 7 02:01:28 tesla1 kernel: [] :nvidia:_nv003003rm+0xec/0x112
Mar 7 02:01:28 tesla1 kernel: [] :nvidia:rm_disable_adapter+0x8f/0xe5
Mar 7 02:01:28 tesla1 kernel: [] :nvidia:nv_kern_close+0x21b/0x3a0
Mar 7 02:01:28 tesla1 kernel: [] __fput+0xae/0x198
Mar 7 02:01:28 tesla1 kernel: [] filp_close+0x5c/0x64
Mar 7 02:01:28 tesla1 kernel: [] put_files_struct+0x6c/0xc3
Mar 7 02:01:28 tesla1 kernel: [] do_exit+0x2d2/0x89d
Mar 7 02:01:28 tesla1 kernel: [] cpuset_exit+0x0/0x6c
Mar 7 02:01:28 tesla1 kernel: [] get_signal_to_deliver+0x42c/0x45a
Mar 7 02:01:28 tesla1 kernel: [] do_notify_resume+0x9c/0x7a9
Mar 7 02:01:28 tesla1 kernel: [] chrdev_open+0x155/0x183
Mar 7 02:01:28 tesla1 kernel: [] chrdev_open+0x0/0x183
Mar 7 02:01:28 tesla1 kernel: [] __dentry_open+0x101/0x1dc
Mar 7 02:01:28 tesla1 kernel: [] do_filp_open+0x2a/0x38
Mar 7 02:01:28 tesla1 kernel: [] sys_chown+0x45/0x56
Mar 7 02:01:28 tesla1 kernel: [] audit_syscall_exit+0x2fb/0x319
Mar 7 02:01:28 tesla1 kernel: [] int_signal+0x12/0x17

Then after reboot, I did ./deviceQuery and got the following - so it looks like the previous result was from a deviceQuery too, note this looks different, but after this one I was able to use the teslas (I followed this with another deviceQuery and a bandwidth test and no errors!)

Mar 10 10:16:43 tesla1 kernel: BUG: soft lockup detected on CPU#4!
Mar 10 10:16:43 tesla1 kernel:
Mar 10 10:16:43 tesla1 kernel: Call Trace:
Mar 10 10:16:43 tesla1 kernel: [] softlockup_tick+0xd5/0xe7
Mar 10 10:16:43 tesla1 kernel: [] update_process_times+0x42/0x68
Mar 10 10:16:43 tesla1 kernel: [] smp_local_timer_interrupt+0x23/0x47
Mar 10 10:16:43 tesla1 kernel: [] smp_apic_timer_interrupt+0x41/0x47
Mar 10 10:16:43 tesla1 kernel: [] apic_timer_interrupt+0x66/0x6c
Mar 10 10:16:43 tesla1 kernel: [] __smp_call_function+0x62/0x8b
Mar 10 10:16:43 tesla1 kernel: [] :nvidia:_nv003936rm+0x20/0x22
Mar 10 10:16:43 tesla1 kernel: [] :nvidia:_nv008191rm+0x2a/0xc7
Mar 10 10:16:43 tesla1 kernel: [] :nvidia:_nv001568rm+0x29/0xf8
Mar 10 10:16:43 tesla1 kernel: [] :nvidia:_nv001588rm+0xda/0x11b
Mar 10 10:16:43 tesla1 kernel: [] :nvidia:_nv006451rm+0x87/0xd5
Mar 10 10:16:43 tesla1 kernel: [] :nvidia:_nv006502rm+0x160/0x1a5
Mar 10 10:16:43 tesla1 kernel: [] :nvidia:_nv006434rm+0x5a/0xc3
Mar 10 10:16:43 tesla1 kernel: [] :nvidia:_nv006439rm+0x57/0x69
Mar 10 10:16:43 tesla1 kernel: [] :nvidia:_nv001754rm+0x1ae/0x22b
Mar 10 10:16:43 tesla1 kernel: [] :nvidia:_nv000322rm+0x23/0x28
Mar 10 10:16:43 tesla1 kernel: [] :nvidia:_nv001970rm+0x60/0x6d
Mar 10 10:16:43 tesla1 kernel: [] :nvidia:_nv001975rm+0x4b/0xe1
Mar 10 10:16:43 tesla1 kernel: [] :nvidia:_nv002015rm+0xb9/0xf4
Mar 10 10:16:43 tesla1 kernel: [] :nvidia:_nv008146rm+0x1b/0xb1
Mar 10 10:16:43 tesla1 kernel: [] :nvidia:_nv003003rm+0xec/0x112
Mar 10 10:16:43 tesla1 kernel: [] :nvidia:rm_disable_adapter+0x8f/0xe5
Mar 10 10:16:43 tesla1 kernel: [] :nvidia:nv_kern_close+0x21b/0x3a0
Mar 10 10:16:43 tesla1 kernel: [] __fput+0xae/0x198
Mar 10 10:16:43 tesla1 kernel: [] filp_close+0x5c/0x64
Mar 10 10:16:43 tesla1 kernel: [] put_files_struct+0x6c/0xc3
Mar 10 10:16:43 tesla1 kernel: [] do_exit+0x2d2/0x89d
Mar 10 10:16:43 tesla1 kernel: [] cpuset_exit+0x0/0x6c
Mar 10 10:16:43 tesla1 kernel: [] tracesys+0xd5/0xe0

jgreen · May 27, 2008, 3:24pm

I am seeing this same problem on an x86_64 platform (HP DL160), with a trace that looks very similar to the one here. Was there ever a resolution to this?

Thanks,

–Joe

Thanks - I did try that originally, and it still crashed in the same way :(

As for the 64-bit installation, I looked through my logs after the 64-bit install - everything was well and good until a little bit later, then got the following (it didn’t crash the kernel but I had to reboot to be able to talk to the teslas again)

Mar 7 02:01:02 tesla1 kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 171.05 Tue Jan 22 15:58:48 PST 2008

Mar 7 02:01:28 tesla1 kernel: BUG: soft lockup detected on CPU#3!

Mar 7 02:01:28 tesla1 kernel:

Mar 7 02:01:28 tesla1 kernel: Call Trace:

Mar 7 02:01:28 tesla1 kernel: [] softlockup_tick+0xd5/0xe7

Mar 7 02:01:28 tesla1 kernel: [] update_process_times+0x42/0x68

Mar 7 02:01:28 tesla1 kernel: [] smp_local_timer_interrupt+0x23/0x47

Mar 7 02:01:28 tesla1 kernel: [] smp_apic_timer_interrupt+0x41/0x47

Mar 7 02:01:28 tesla1 kernel: [] apic_timer_interrupt+0x66/0x6c

Mar 7 02:01:28 tesla1 kernel: [] __smp_call_function+0x62/0x8b

Mar 7 02:01:28 tesla1 kernel: [] :nvidia:_nv003936rm+0x20/0x22

Mar 7 02:01:28 tesla1 kernel: [] :nvidia:_nv008191rm+0x2a/0xc7

Mar 7 02:01:28 tesla1 kernel: [] :nvidia:_nv001568rm+0x29/0xf8

Mar 7 02:01:28 tesla1 kernel: [] :nvidia:_nv001588rm+0xb7/0x11b

Mar 7 02:01:28 tesla1 kernel: [] :nvidia:_nv006451rm+0x87/0xd5

Mar 7 02:01:28 tesla1 kernel: [] :nvidia:_nv006473rm+0xde/0x125

Mar 7 02:01:28 tesla1 kernel: [] :nvidia:_nv006434rm+0x46/0xc3

Mar 7 02:01:28 tesla1 kernel: [] :nvidia:_nv006439rm+0x57/0x69

Mar 7 02:01:28 tesla1 kernel: [] :nvidia:_nv001754rm+0x1ae/0x22b

Mar 7 02:01:28 tesla1 kernel: [] :nvidia:_nv000322rm+0x23/0x28

Mar 7 02:01:28 tesla1 kernel: [] :nvidia:_nv001970rm+0x60/0x6d

Mar 7 02:01:28 tesla1 kernel: [] :nvidia:_nv001975rm+0x4b/0xe1

Mar 7 02:01:28 tesla1 kernel: [] :nvidia:_nv002015rm+0xb9/0xf4

Mar 7 02:01:28 tesla1 kernel: [] :nvidia:_nv008146rm+0x1b/0xb1

Mar 7 02:01:28 tesla1 kernel: [] :nvidia:_nv003003rm+0xec/0x112

Mar 7 02:01:28 tesla1 kernel: [] :nvidia:rm_disable_adapter+0x8f/0xe5

Mar 7 02:01:28 tesla1 kernel: [] :nvidia:nv_kern_close+0x21b/0x3a0

Mar 7 02:01:28 tesla1 kernel: [] __fput+0xae/0x198

Mar 7 02:01:28 tesla1 kernel: [] filp_close+0x5c/0x64

Mar 7 02:01:28 tesla1 kernel: [] put_files_struct+0x6c/0xc3

Mar 7 02:01:28 tesla1 kernel: [] do_exit+0x2d2/0x89d

Mar 7 02:01:28 tesla1 kernel: [] cpuset_exit+0x0/0x6c

Mar 7 02:01:28 tesla1 kernel: [] get_signal_to_deliver+0x42c/0x45a

Mar 7 02:01:28 tesla1 kernel: [] do_notify_resume+0x9c/0x7a9

Mar 7 02:01:28 tesla1 kernel: [] chrdev_open+0x155/0x183

Mar 7 02:01:28 tesla1 kernel: [] chrdev_open+0x0/0x183

Mar 7 02:01:28 tesla1 kernel: [] __dentry_open+0x101/0x1dc

Mar 7 02:01:28 tesla1 kernel: [] do_filp_open+0x2a/0x38

Mar 7 02:01:28 tesla1 kernel: [] sys_chown+0x45/0x56

Mar 7 02:01:28 tesla1 kernel: [] audit_syscall_exit+0x2fb/0x319

Mar 7 02:01:28 tesla1 kernel: [] int_signal+0x12/0x17

Then after reboot, I did ./deviceQuery and got the following - so it looks like the previous result was from a deviceQuery too, note this looks different, but after this one I was able to use the teslas (I followed this with another deviceQuery and a bandwidth test and no errors!)

Mar 10 10:16:43 tesla1 kernel: BUG: soft lockup detected on CPU#4!

Mar 10 10:16:43 tesla1 kernel:

Mar 10 10:16:43 tesla1 kernel: Call Trace:

Mar 10 10:16:43 tesla1 kernel: [] softlockup_tick+0xd5/0xe7

Mar 10 10:16:43 tesla1 kernel: [] update_process_times+0x42/0x68

Mar 10 10:16:43 tesla1 kernel: [] smp_local_timer_interrupt+0x23/0x47

Mar 10 10:16:43 tesla1 kernel: [] smp_apic_timer_interrupt+0x41/0x47

Mar 10 10:16:43 tesla1 kernel: [] apic_timer_interrupt+0x66/0x6c

Mar 10 10:16:43 tesla1 kernel: [] __smp_call_function+0x62/0x8b

Mar 10 10:16:43 tesla1 kernel: [] :nvidia:_nv003936rm+0x20/0x22

Mar 10 10:16:43 tesla1 kernel: [] :nvidia:_nv008191rm+0x2a/0xc7

Mar 10 10:16:43 tesla1 kernel: [] :nvidia:_nv001568rm+0x29/0xf8

Mar 10 10:16:43 tesla1 kernel: [] :nvidia:_nv001588rm+0xda/0x11b

Mar 10 10:16:43 tesla1 kernel: [] :nvidia:_nv006451rm+0x87/0xd5

Mar 10 10:16:43 tesla1 kernel: [] :nvidia:_nv006502rm+0x160/0x1a5

Mar 10 10:16:43 tesla1 kernel: [] :nvidia:_nv006434rm+0x5a/0xc3

Mar 10 10:16:43 tesla1 kernel: [] :nvidia:_nv006439rm+0x57/0x69

Mar 10 10:16:43 tesla1 kernel: [] :nvidia:_nv001754rm+0x1ae/0x22b

Mar 10 10:16:43 tesla1 kernel: [] :nvidia:_nv000322rm+0x23/0x28

Mar 10 10:16:43 tesla1 kernel: [] :nvidia:_nv001970rm+0x60/0x6d

Mar 10 10:16:43 tesla1 kernel: [] :nvidia:_nv001975rm+0x4b/0xe1

Mar 10 10:16:43 tesla1 kernel: [] :nvidia:_nv002015rm+0xb9/0xf4

Mar 10 10:16:43 tesla1 kernel: [] :nvidia:_nv008146rm+0x1b/0xb1

Mar 10 10:16:43 tesla1 kernel: [] :nvidia:_nv003003rm+0xec/0x112

Mar 10 10:16:43 tesla1 kernel: [] :nvidia:rm_disable_adapter+0x8f/0xe5

Mar 10 10:16:43 tesla1 kernel: [] :nvidia:nv_kern_close+0x21b/0x3a0

Mar 10 10:16:43 tesla1 kernel: [] __fput+0xae/0x198

Mar 10 10:16:43 tesla1 kernel: [] filp_close+0x5c/0x64

Mar 10 10:16:43 tesla1 kernel: [] put_files_struct+0x6c/0xc3

Mar 10 10:16:43 tesla1 kernel: [] do_exit+0x2d2/0x89d

Mar 10 10:16:43 tesla1 kernel: [] cpuset_exit+0x0/0x6c

Mar 10 10:16:43 tesla1 kernel: [] tracesys+0xd5/0xe0

[snapback]340859[/snapback]

netllama · May 27, 2008, 3:40pm

Does this reproduce with the CUDA_2.0 driver?

jgreen · May 27, 2008, 4:00pm

I have not yet tried the 2.0 release. I have been using 1.1. My setup worked fine for about a month and a half, but this morning when we came in, my co-worker tried to run something on the S870 and we started seeing this error. We can now reproduce it by running nvidia-smi, even across reboots. It is, at this point, unusable.

If the 2.0 release is thought to help, I will give it a shot.

netllama · May 27, 2008, 4:09pm

If this had been working, and then suddenly stopped, then I’d suggesting isolating what changed.

jgreen · May 27, 2008, 4:18pm

The only change was my adding a new user account about two weeks ago. Other than that, only usage of the system was by non-privileged users.

I will see if I can walk backwards and see what started the problem.

Thanks,

–Joe

jgreen · May 27, 2008, 8:12pm

Is the nvidia-smi command installed by CUDA, or is it the driver package (i.e. NVIDIA-Linux-x86_64-171.06.01-pkg2.run)?

I am getting failures running nvidia-smi, so if that doesn’t touch the CUDA environment, then my problem is CUDA independent…

mfatica · May 27, 2008, 9:16pm

Can you try to install this driver?

http://www.nvidia.com/object/linux_display…_171.06.01.html

jgreen · May 28, 2008, 12:55pm

I installed 171.06.01 yesterday.

Unfortunately, I am seeing the exact same behavior – when running nvidia-smi, the BUG shows up during the “Getting unit information…” phase of the program (the discovery and attaching steps both go fine; the errors show up in my logs during the getting unit info stage). All nvidia-smi tells me is “Failed to read byte,” which it repeats a few times (which I’d guess is for each device?). I’ve never seen it exit, it usually seems to hang. This morning maybe I’ll do it again and let it run on to see if it ever actually finishes/dies. The last few days, I’ve just rebooted after 5 minutes of it not doing anything, since the system was clearly hosed.

I really can’t see anything that has changed on my system in the last month to cause this problem. Yesterday, before I installed 171.06.01, I did a yum update, caught a bunch of new stuff, including a kernel update, so I rebooted before installing the driver bundle. After reboot, I did the 171.06.01 install, rebooted again to be sure, then tried to probe, but still having problems.

jgreen · May 28, 2008, 1:28pm

I let nvidia-smi run to completion this morning, and this is the result:

$ nvidia-smi

Gpus found in probe:
Found Gpuid 0x1c000
Found Gpuid 0x1a000
Found Gpuid 0x11000
Found Gpuid 0xf000
Attaching all probed Gpus…OK
Getting unit information…Failed to read byte
Failed to read byte
Failed to read byte
Failed to read byte
Failed to read byte
Failed to read byte
OK
Getting all static information…
Failed to read byte
Failed to read byte
Couldn’t get unit info!
failed!
*** glibc detected *** nvidia-smi: double free or corruption (top): 0x00000000151578f0

======= Backtrace: =========
/lib64/libc.so.6[0x395f26f4f4]
/lib64/libc.so.6(cfree+0x8c)[0x395f272b1c]
nvidia-smi[0x40ecdf]
nvidia-smi[0x40f0cf]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x395f21d8a4]
nvidia-smi(strcat+0x42)[0x40193a]
======= Memory map: ========
00400000-0041c000 r-xp 00000000 08:02 691434 /usr/bin/nvidia-smi
0051c000-00520000 rwxp 0001c000 08:02 691434 /usr/bin/nvidia-smi
15157000-15178000 rwxp 15157000 00:00 0
395e000000-395e01a000 r-xp 00000000 08:02 1142426 /lib64/ld-2.5.so
395e219000-395e21a000 r-xp 00019000 08:02 1142426 /lib64/ld-2.5.so
395e21a000-395e21b000 rwxp 0001a000 08:02 1142426 /lib64/ld-2.5.so
395f200000-395f346000 r-xp 00000000 08:02 1142428 /lib64/libc-2.5.so
395f346000-395f546000 —p 00146000 08:02 1142428 /lib64/libc-2.5.so
395f546000-395f54a000 r-xp 00146000 08:02 1142428 /lib64/libc-2.5.so
395f54a000-395f54b000 rwxp 0014a000 08:02 1142428 /lib64/libc-2.5.so
395f54b000-395f550000 rwxp 395f54b000 00:00 0
396f400000-396f40d000 r-xp 00000000 08:02 1142587 /lib64/libgcc_s-4.1.2-20070626.so.1
396f40d000-396f60d000 —p 0000d000 08:02 1142587 /lib64/libgcc_s-4.1.2-20070626.so.1
396f60d000-396f60e000 rwxp 0000d000 08:02 1142587 /lib64/libgcc_s-4.1.2-20070626.so.1
2aaaaaaab000-2aaaaaaac000 rwxp 2aaaaaaab000 00:00 0
2aaaaaac5000-2aaaaaac7000 rwxp 2aaaaaac5000 00:00 0
2aaaac000000-2aaaac021000 rwxp 2aaaac000000 00:00 0
2aaaac021000-2aaab0000000 —p 2aaaac021000 00:00 0
7fff85994000-7fff859a9000 rwxp 7fff85994000 00:00 0 [stack]
ffffffffff600000-ffffffffffe00000 —p 00000000 00:00 0 [vdso]
Aborted

netllama · May 28, 2008, 1:32pm

nvidia-smi is part of the driver package, and is not CUDA dependent. Its not clear from your posts whether you’ve tested the CUDA_2.0 driver yet?

jgreen · May 28, 2008, 2:49pm

I haven’t tried anything CUDA yet in the last two days, because I can’t even get the nvidia-smi probe to successfully complete.

Should I expect CUDA to act differently if nvidia-smi is failing with kernel BUG traces in my logs and the above mentioned types of errors in its output?

Topic		Replies	Views
nvidia-smi "No devices were found" error CUDA Setup and Installation	23	62419	February 14, 2021
GPU has fallen off the bus issues on daily basis (RTX 4090) Linux pcie , cuda , ubuntu , rtx	9	1277	April 26, 2025
Power9 - nvidia-smi shows "unknown error" in memory column Linux	35	10247	October 14, 2021
Install Problem CUDA Programming and Performance	32	12706	December 17, 2009
Some funkiness with my first try at running a linux nvidia driver. Linux	20	4202	September 15, 2015
"NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver" Ubuntu 16.04 CUDA Setup and Installation	79	371502	March 19, 2021
Device driver crash (unable to handle page fault) after suspend-&-resume with version 555.58.02 on Linux kernel v6.9.9 Linux kernel	13	1795	October 17, 2024
2 Tesla C1060s with a legacy GeForce FX 5200 card Need help editing the xorg.conf file for multiple CUDA Programming and Performance	28	35534	January 29, 2009
Nvidia driver kernel random call trace Linux	14	1649	November 24, 2024
bandwidthTest example throws cudaErrorCallRequiresNewerDriver error when launched via nv-nsight-cu-cli Nsight Compute linux , driver	17	1314	February 9, 2024

S870 causes kernel panic Device query of S870 crashes kernel

Related topics