Hardware:
Dell Precision T5500, intel based, Nvidia Quadro NVS 295 + Tesla S1070
Software:
Red Hat Enterprise 5.3
gcc 4.1
g++ 4.1
latest driver from nvidia cuda zone (downloaded last wednesday)
latest toolkit (same source)
latest SDK (same source)
The driver installed without errors following instructions on the brochure included
inside the S1070 box. The driver included with the box did not work, system
crashed. We can launch the driver and identifies the actual graphics card
(Quadro) and the Tesla HPC with the right labels. However, the lspci |grep nVidia
commend gives: (!)
[codebox]03:00.0 VGA compatible controller: nVidia Corporation Unknown device 06fd
(rev a1)
81:00.0 PCI bridge: nVidia Corporation Tesla S870
(rev a3)
82:00.0 PCI bridge: nVidia Corporation Tesla S870
(rev a3)
82:01.0 PCI bridge: nVidia Corporation Tesla S870
(rev a3)
82:02.0 PCI bridge: nVidia Corporation Tesla S870
(rev a3)
82:03.0 PCI bridge: nVidia Corporation Tesla S870
(rev a3)
85:00.0 PCI bridge: nVidia Corporation Tesla S870
(rev a3)
86:00.0 PCI bridge: nVidia Corporation Tesla S870
(rev a3)
86:01.0 PCI bridge: nVidia Corporation Tesla S870
(rev a3)
86:02.0 PCI bridge: nVidia Corporation Tesla S870
(rev a3)
86:03.0 PCI bridge: nVidia Corporation Tesla S870
(rev a3)
87:00.0 3D controller: nVidia Corporation Unknown device 05e7
(rev a1)
89:00.0 3D controller: nVidia Corporation Unknown device 05e7
(rev a1)
8c:00.0 PCI bridge: nVidia Corporation Tesla S870
(rev a3)
8d:00.0 PCI bridge: nVidia Corporation Tesla S870
(rev a3)
8d:01.0 PCI bridge: nVidia Corporation Tesla S870
(rev a3)
8d:02.0 PCI bridge: nVidia Corporation Tesla S870
(rev a3)
8d:03.0 PCI bridge: nVidia Corporation Tesla S870
(rev a3)
90:00.0 PCI bridge: nVidia Corporation Tesla S870
(rev a3)
91:00.0 PCI bridge: nVidia Corporation Tesla S870
(rev a3)
91:01.0 PCI bridge: nVidia Corporation Tesla S870
(rev a3)
91:02.0 PCI bridge: nVidia Corporation Tesla S870
(rev a3)
91:03.0 PCI bridge: nVidia Corporation Tesla S870
(rev a3)
92:00.0 3D controller: nVidia Corporation Unknown device 05e7
(rev a1)
94:00.0 3D controller: nVidia Corporation Unknown device 05e7 (rev a1)
[/codebox]
and dmesg gives: :angry:
[codebox]nvidia: module license ‘NVIDIA’ taints kernel.
GSI 23 sharing vector 0x6A and IRQ 23
ACPI: PCI Interrupt 0000:03:00.0[A] → GSI 24 (level, low) → IRQ 106
PCI: Setting latency timer of device 0000:03:00.0 to 64
GSI 24 sharing vector 0x72 and IRQ 24
ACPI: PCI Interrupt 0000:87:00.0[A] → GSI 63 (level, low) → IRQ 114
PCI: Setting latency timer of device 0000:87:00.0 to 64
GSI 25 sharing vector 0x7A and IRQ 25
ACPI: PCI Interrupt 0000:89:00.0[A] → GSI 54 (level, low) → IRQ 122
PCI: Setting latency timer of device 0000:89:00.0 to 64
GSI 26 sharing vector 0x82 and IRQ 26
ACPI: PCI Interrupt 0000:92:00.0[A] → GSI 64 (level, low) → IRQ 130
PCI: Setting latency timer of device 0000:92:00.0 to 64
GSI 27 sharing vector 0x8A and IRQ 27
ACPI: PCI Interrupt 0000:94:00.0[A] → GSI 56 (level, low) → IRQ 138
PCI: Setting latency timer of device 0000:94:00.0 to 64
NVRM: loading NVIDIA UNIX x86_64 Kernel Module 185.18.14 Wed May 27 01:23:47 PDT 2009
ACPI: PCI Interrupt 0000:00:1b.0[A] → GSI 16 (level, low) → IRQ 169
PCI: Setting latency timer of device 0000:00:1b.0 to 64
floppy0: no floppy controllers found
Floppy drive(s): fd0 is 1.44M
floppy0: no floppy controllers found
lp0: using parport0 (interrupt-driven).
lp0: console ready
NET: Registered protocol family 10
lo: Disabled Privacy Extensions
IPv6 over IPv4 tunneling driver
irq 82: nobody cared (try booting with the “irqpoll” option)
Call Trace:
[] __report_bad_irq+0x30/0x7d
[] note_interrupt+0x1e6/0x227
[] __do_IRQ+0xbd/0x103
[] __do_softirq+0x89/0x133
[] do_IRQ+0xe7/0xf5
[] ret_from_intr+0x0/0xa
[] acpi_processor_idle+0x275/0x43a
[] acpi_processor_idle+0x26b/0x43a
[] notifier_call_chain+0x8/0x32
[] acpi_processor_idle+0x0/0x43a
[] cpu_idle+0x95/0xb8
[] start_kernel+0x220/0x225
[] _sinittext+0x22f/0x236
handlers:
[] (usb_hcd_irq+0x0/0x55)
Disabling IRQ #82
NVRM: RmInitAdapter failed! (0x12:0x2b:1697)
NVRM: rm_init_adapter(3) failed
ACPI: Power Button (FF) [PWRF]
ACPI: Power Button (CM) [VBTN]
md: Autodetecting RAID arrays.
md: autorun …
md: … autorun DONE.
device-mapper: multipath: version 1.0.5 loaded
EXT3 FS on dm-0, internal journal
kjournald starting. Commit interval 5 seconds
EXT3 FS on sda1, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
Adding 6094840k swap on /dev/VolGroup00/LogVol01. Priority:-1 extents:1 across:6094840k
IA-32 Microcode Update Driver: v1.14a tigran@veritas.com
ip6_tables: © 2000-2006 Netfilter Core Team
ip_tables: © 2000-2006 Netfilter Core Team
Netfilter messages via NETLINK v0.30.
ip_conntrack version 2.4 (8192 buckets, 65536 max) - 304 bytes per conntrack
ADDRCONF(NETDEV_UP): eth0: link is not ready
Bluetooth: Core ver 2.10
NET: Registered protocol family 31
Bluetooth: HCI device and connection manager initialized
Bluetooth: HCI socket layer initialized
Bluetooth: L2CAP ver 2.8
Bluetooth: L2CAP socket layer initialized
Bluetooth: RFCOMM socket layer initialized
Bluetooth: RFCOMM TTY layer initialized
Bluetooth: RFCOMM ver 1.8
Bluetooth: HIDP (Human Interface Emulation) ver 1.1
NVRM: RmInitAdapter failed! (0x12:0x2b:1697)
NVRM: rm_init_adapter(3) failed
[/codebox]
Any CUDA code compile well but when executed gives the usual error
about
NVIDIA “could not open the device file /dev/nvidiaX” External Image
Could it be a hardware malfunction problem or perhaps the OS did not
detect the PCIe cards properly? :wacko:
Any help is very welcome!