Tesla S1070 under RH 5.3 S1070 not detected correctly by T5500

Hardware:

Dell Precision T5500, intel based, Nvidia Quadro NVS 295 + Tesla S1070

Software:

Red Hat Enterprise 5.3

 gcc 4.1

 g++ 4.1

 latest driver from nvidia cuda zone (downloaded last wednesday)

 latest toolkit (same source)

 latest SDK (same source)

The driver installed without errors following instructions on the brochure included

inside the S1070 box. The driver included with the box did not work, system

crashed. We can launch the driver and identifies the actual graphics card

(Quadro) and the Tesla HPC with the right labels. However, the lspci |grep nVidia

commend gives: (!)

[codebox]03:00.0 VGA compatible controller: nVidia Corporation Unknown device 06fd

(rev a1)

81:00.0 PCI bridge: nVidia Corporation Tesla S870

(rev a3)

82:00.0 PCI bridge: nVidia Corporation Tesla S870

(rev a3)

82:01.0 PCI bridge: nVidia Corporation Tesla S870

(rev a3)

82:02.0 PCI bridge: nVidia Corporation Tesla S870

(rev a3)

82:03.0 PCI bridge: nVidia Corporation Tesla S870

(rev a3)

85:00.0 PCI bridge: nVidia Corporation Tesla S870

(rev a3)

86:00.0 PCI bridge: nVidia Corporation Tesla S870

(rev a3)

86:01.0 PCI bridge: nVidia Corporation Tesla S870

(rev a3)

86:02.0 PCI bridge: nVidia Corporation Tesla S870

(rev a3)

86:03.0 PCI bridge: nVidia Corporation Tesla S870

(rev a3)

87:00.0 3D controller: nVidia Corporation Unknown device 05e7

(rev a1)

89:00.0 3D controller: nVidia Corporation Unknown device 05e7

(rev a1)

8c:00.0 PCI bridge: nVidia Corporation Tesla S870

(rev a3)

8d:00.0 PCI bridge: nVidia Corporation Tesla S870

(rev a3)

8d:01.0 PCI bridge: nVidia Corporation Tesla S870

(rev a3)

8d:02.0 PCI bridge: nVidia Corporation Tesla S870

(rev a3)

8d:03.0 PCI bridge: nVidia Corporation Tesla S870

(rev a3)

90:00.0 PCI bridge: nVidia Corporation Tesla S870

(rev a3)

91:00.0 PCI bridge: nVidia Corporation Tesla S870

(rev a3)

91:01.0 PCI bridge: nVidia Corporation Tesla S870

(rev a3)

91:02.0 PCI bridge: nVidia Corporation Tesla S870

(rev a3)

91:03.0 PCI bridge: nVidia Corporation Tesla S870

(rev a3)

92:00.0 3D controller: nVidia Corporation Unknown device 05e7

(rev a1)

94:00.0 3D controller: nVidia Corporation Unknown device 05e7 (rev a1)

[/codebox]

and dmesg gives: :angry:

[codebox]nvidia: module license ‘NVIDIA’ taints kernel.

GSI 23 sharing vector 0x6A and IRQ 23

ACPI: PCI Interrupt 0000:03:00.0[A] -> GSI 24 (level, low) -> IRQ 106

PCI: Setting latency timer of device 0000:03:00.0 to 64

GSI 24 sharing vector 0x72 and IRQ 24

ACPI: PCI Interrupt 0000:87:00.0[A] -> GSI 63 (level, low) -> IRQ 114

PCI: Setting latency timer of device 0000:87:00.0 to 64

GSI 25 sharing vector 0x7A and IRQ 25

ACPI: PCI Interrupt 0000:89:00.0[A] -> GSI 54 (level, low) -> IRQ 122

PCI: Setting latency timer of device 0000:89:00.0 to 64

GSI 26 sharing vector 0x82 and IRQ 26

ACPI: PCI Interrupt 0000:92:00.0[A] -> GSI 64 (level, low) -> IRQ 130

PCI: Setting latency timer of device 0000:92:00.0 to 64

GSI 27 sharing vector 0x8A and IRQ 27

ACPI: PCI Interrupt 0000:94:00.0[A] -> GSI 56 (level, low) -> IRQ 138

PCI: Setting latency timer of device 0000:94:00.0 to 64

NVRM: loading NVIDIA UNIX x86_64 Kernel Module 185.18.14 Wed May 27 01:23:47 PDT 2009

ACPI: PCI Interrupt 0000:00:1b.0[A] -> GSI 16 (level, low) -> IRQ 169

PCI: Setting latency timer of device 0000:00:1b.0 to 64

floppy0: no floppy controllers found

Floppy drive(s): fd0 is 1.44M

floppy0: no floppy controllers found

lp0: using parport0 (interrupt-driven).

lp0: console ready

NET: Registered protocol family 10

lo: Disabled Privacy Extensions

IPv6 over IPv4 tunneling driver

irq 82: nobody cared (try booting with the “irqpoll” option)

Call Trace:

[] __report_bad_irq+0x30/0x7d

[] note_interrupt+0x1e6/0x227

[] __do_IRQ+0xbd/0x103

[] __do_softirq+0x89/0x133

[] do_IRQ+0xe7/0xf5

[] ret_from_intr+0x0/0xa

[] acpi_processor_idle+0x275/0x43a

[] acpi_processor_idle+0x26b/0x43a

[] notifier_call_chain+0x8/0x32

[] acpi_processor_idle+0x0/0x43a

[] cpu_idle+0x95/0xb8

[] start_kernel+0x220/0x225

[] _sinittext+0x22f/0x236

handlers:

[] (usb_hcd_irq+0x0/0x55)

Disabling IRQ #82

NVRM: RmInitAdapter failed! (0x12:0x2b:1697)

NVRM: rm_init_adapter(3) failed

ACPI: Power Button (FF) [PWRF]

ACPI: Power Button (CM) [VBTN]

md: Autodetecting RAID arrays.

md: autorun …

md: … autorun DONE.

device-mapper: multipath: version 1.0.5 loaded

EXT3 FS on dm-0, internal journal

kjournald starting. Commit interval 5 seconds

EXT3 FS on sda1, internal journal

EXT3-fs: mounted filesystem with ordered data mode.

Adding 6094840k swap on /dev/VolGroup00/LogVol01. Priority:-1 extents:1 across:6094840k

IA-32 Microcode Update Driver: v1.14a tigran@veritas.com

ip6_tables: © 2000-2006 Netfilter Core Team

ip_tables: © 2000-2006 Netfilter Core Team

Netfilter messages via NETLINK v0.30.

ip_conntrack version 2.4 (8192 buckets, 65536 max) - 304 bytes per conntrack

ADDRCONF(NETDEV_UP): eth0: link is not ready

Bluetooth: Core ver 2.10

NET: Registered protocol family 31

Bluetooth: HCI device and connection manager initialized

Bluetooth: HCI socket layer initialized

Bluetooth: L2CAP ver 2.8

Bluetooth: L2CAP socket layer initialized

Bluetooth: RFCOMM socket layer initialized

Bluetooth: RFCOMM TTY layer initialized

Bluetooth: RFCOMM ver 1.8

Bluetooth: HIDP (Human Interface Emulation) ver 1.1

NVRM: RmInitAdapter failed! (0x12:0x2b:1697)

NVRM: rm_init_adapter(3) failed

[/codebox]

Any CUDA code compile well but when executed gives the usual error

about

NVIDIA “could not open the device file /dev/nvidiaX” :thumbsdown:

Could it be a hardware malfunction problem or perhaps the OS did not

detect the PCIe cards properly? :wacko:

Any help is very welcome!

Did you read the release notes:

http://developer.download.nvidia.com/compu…notes_linux.txt

Thank you very much for your reply. Yes, I did. In fact, the

[codebox]title Red Hat Desktop (2.6.9-42.ELsmp)

root (hd0,0)

uppermem 524288

kernel /vmlinuz-2.6.9-42.ELsmp ro root=LABEL=/1 rhgb quiet vmalloc=256MB

pci=nommconf

initrd /initrd-2.6.9-42.ELsmp.img

[/codebox]

solved one of our initial problems; after installing the latest Nvidia driver, the HAL service was dying

during boot. Adding those options to grub.conf, made HAL stable again. :w00t:

The /dev/… entries are correctly created and they have the right permissions according to multiple

posts in these forums. We are working as root. Deactivating SElinux does not make any difference. :angry:

After a lot of pain and suffering we have found a solution.

The key was here:

[codebox]irq 82: nobody cared (try booting with the “irqpoll” option)

Call Trace:

[] __report_bad_irq+0x30/0x7d

[] note_interrupt+0x1e6/0x227

[] __do_IRQ+0xbd/0x103

[] __do_softirq+0x89/0x133

[] do_IRQ+0xe7/0xf5

[] ret_from_intr+0x0/0xa

[] acpi_processor_idle+0x275/0x43a

[] acpi_processor_idle+0x26b/0x43a

[] notifier_call_chain+0x8/0x32

[] acpi_processor_idle+0x0/0x43a

[] cpu_idle+0x95/0xb8

[] start_kernel+0x220/0x225

[] _sinittext+0x22f/0x236

handlers:

[] (usb_hcd_irq+0x0/0x55)

Disabling IRQ #82

NVRM: RmInitAdapter failed! (0x12:0x2b:1697)

NVRM: rm_init_adapter(3) failed

[/codebox]

This string of errors did not show up if we uninstall the cards so

it was the result of I/O APIC problems or conflicts with other

devices sharing the IRQ. :w00t: Using the noapic kernel

parameter in grub.conf did the trick. :yes: Now the CUDA programs

run properly and give the expected results. :thumbup: