CUDA 4 + driver 270.35 (C2050) random errors

[SOLVED see the last post]

Hi all,

I’m obtaining random error on my old application working fine with cuda 3.2 (using

the old drivers). Some errors are simply incorrect result, some errors are: unspecified

random failure. After this happens the entire server goes in an inconsistent state.

Right now for example running deviceQuery or nvidia-smi is what I get:

deviceQuery:

$ ./deviceQuery 

./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount FAILED CUDA Driver and Runtime version may be mismatched.

FAILED

nvidia-smi:

# nvidia-smi -q

NVIDIA: could not open the device file /dev/nvidia0 (Input/output error).

Failed to initialize NVML: Unknown Error

# ls -al /dev/nvidia0

crw-rw-rw- 1 root root 195, 0 2011-03-31 10:20 /dev/nvidia0

removing the driver with rmmod and running again those two above doesn’t

solve the problem.

Information about my server:

$ cat /proc/version

Linux version 2.6.32-30-server (buildd@crested) (gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5) ) #59-Ubuntu SMP Tue Mar 1 22:46:09 UTC 2011

The same code, CUDA4.0, same driver are running fine on another server equiped with C1060

Are you sure that the driver was properly loaded/updated on this machine?
What is the output of “cat /proc/driver/nvidia/version”?

$ cat /proc/driver/nvidia/version

NVRM version: NVIDIA UNIX x86_64 Kernel Module 270.35 Fri Mar 18 11:48:56 PDT 2011

GCC version: gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5)

also when the server goes in that state the only way to solve the issue is to reboot it, a simple

rmmod nvida doesn’t work.

I tried to roll back to old driver version (270.27) and even old CUDA 3.2. What I see in

/var/log/messages I see related to lines containing NVRM:

Mar 31 11:07:57 essa-prototype kernel: [ 3320.440073] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 270.35 Fri Mar 18 11:48:56 PDT 2011

Mar 31 11:07:58 essa-prototype kernel: [ 3321.508289] NVRM: RmInitAdapter failed! (0x27:0xffffffff:1052)

Mar 31 11:07:58 essa-prototype kernel: [ 3321.508300] NVRM: rm_init_adapter(0) failed

Mar 31 11:09:49 essa-prototype kernel: [ 3432.101095] NVRM: RmInitAdapter failed! (0x27:0xffffffff:1052)

Mar 31 11:09:49 essa-prototype kernel: [ 3432.101106] NVRM: rm_init_adapter(0) failed

Mar 31 11:12:05 essa-prototype kernel: [ 3567.787543] NVRM: RmInitAdapter failed! (0x27:0xffffffff:1052)

Mar 31 11:12:05 essa-prototype kernel: [ 3567.787554] NVRM: rm_init_adapter(0) failed

Mar 31 11:12:24 essa-prototype kernel: [ 3586.905099] NVRM: RmInitAdapter failed! (0x27:0xffffffff:1052)

Mar 31 11:12:24 essa-prototype kernel: [ 3586.905111] NVRM: rm_init_adapter(0) failed

Mar 31 11:13:14 essa-prototype kernel: [ 3637.172132] NVRM: RmInitAdapter failed! (0x27:0xffffffff:1052)

Mar 31 11:13:14 essa-prototype kernel: [ 3637.172143] NVRM: rm_init_adapter(0) failed

Mar 31 11:13:28 essa-prototype kernel: [ 3651.130379] NVRM: RmInitAdapter failed! (0x27:0xffffffff:1052)

Mar 31 11:13:28 essa-prototype kernel: [ 3651.130390] NVRM: rm_init_adapter(0) failed

Mar 31 11:14:07 essa-prototype kernel: [ 3689.895181] NVRM: RmInitAdapter failed! (0x27:0xffffffff:1052)

Mar 31 11:14:07 essa-prototype kernel: [ 3689.895192] NVRM: rm_init_adapter(0) failed

Mar 31 11:14:11 essa-prototype kernel: [ 3693.907483] NVRM: RmInitAdapter failed! (0x27:0xffffffff:1052)

Mar 31 11:14:11 essa-prototype kernel: [ 3693.907494] NVRM: rm_init_adapter(0) failed

Mar 31 11:14:12 essa-prototype kernel: [ 3695.134598] NVRM: RmInitAdapter failed! (0x27:0xffffffff:1052)

Mar 31 11:14:12 essa-prototype kernel: [ 3695.134609] NVRM: rm_init_adapter(0) failed

Mar 31 11:14:15 essa-prototype kernel: [ 3697.891329] NVRM: RmInitAdapter failed! (0x27:0xffffffff:1052)

Mar 31 11:14:15 essa-prototype kernel: [ 3697.891340] NVRM: rm_init_adapter(0) failed

Mar 31 11:14:18 essa-prototype kernel: [ 3701.702352] NVRM: RmInitAdapter failed! (0x27:0xffffffff:1052)

Mar 31 11:14:18 essa-prototype kernel: [ 3701.702363] NVRM: rm_init_adapter(0) failed

Mar 31 11:15:04 essa-prototype kernel: [ 3747.196652] NVRM: RmInitAdapter failed! (0x27:0xffffffff:1052)

Mar 31 11:15:04 essa-prototype kernel: [ 3747.196663] NVRM: rm_init_adapter(0) failed

Mar 31 11:23:45 essa-prototype kernel: [ 4268.401478] NVRM: RmInitAdapter failed! (0x27:0xffffffff:1052)

Mar 31 11:23:45 essa-prototype kernel: [ 4268.401489] NVRM: rm_init_adapter(0) failed

Mar 31 11:25:01 essa-prototype kernel: [ 4343.956896] NVRM: RmInitAdapter failed! (0x27:0xffffffff:1052)

Mar 31 11:25:01 essa-prototype kernel: [ 4343.956907] NVRM: rm_init_adapter(0) failed

Mar 31 11:28:11 essa-prototype kernel: [ 4533.664381] NVRM: RmInitAdapter failed! (0x27:0xffffffff:1052)

Mar 31 11:28:11 essa-prototype kernel: [ 4533.664392] NVRM: rm_init_adapter(0) failed

Mar 31 11:28:58 essa-prototype kernel: [ 4581.133374] NVRM: RmInitAdapter failed! (0x27:0xffffffff:1052)

Mar 31 11:28:58 essa-prototype kernel: [ 4581.133385] NVRM: rm_init_adapter(0) failed

Mar 31 11:28:59 essa-prototype kernel: [ 4582.524573] NVRM: RmInitAdapter failed! (0x27:0xffffffff:1052)

Mar 31 11:28:59 essa-prototype kernel: [ 4582.524585] NVRM: rm_init_adapter(0) failed

Mar 31 11:29:00 essa-prototype kernel: [ 4583.424622] NVRM: RmInitAdapter failed! (0x27:0xffffffff:1052)

Mar 31 11:29:00 essa-prototype kernel: [ 4583.424632] NVRM: rm_init_adapter(0) failed

Mar 31 11:29:01 essa-prototype kernel: [ 4584.283817] NVRM: RmInitAdapter failed! (0x27:0xffffffff:1052)

Mar 31 11:29:01 essa-prototype kernel: [ 4584.283828] NVRM: rm_init_adapter(0) failed

Mar 31 11:29:02 essa-prototype kernel: [ 4585.082906] NVRM: RmInitAdapter failed! (0x27:0xffffffff:1052)

Mar 31 11:29:02 essa-prototype kernel: [ 4585.082917] NVRM: rm_init_adapter(0) failed

Mar 31 11:29:05 essa-prototype kernel: [ 4587.803114] NVRM: RmInitAdapter failed! (0x27:0xffffffff:1052)

Mar 31 11:29:05 essa-prototype kernel: [ 4587.803125] NVRM: rm_init_adapter(0) failed

Mar 31 11:29:20 essa-prototype kernel: [ 4603.005283] NVRM: RmInitAdapter failed! (0x27:0xffffffff:1052)

Mar 31 11:29:20 essa-prototype kernel: [ 4603.005293] NVRM: rm_init_adapter(0) failed

Mar 31 11:31:21 essa-prototype kernel: [ 4723.967072] NVRM: RmInitAdapter failed! (0x27:0xffffffff:1052)

Mar 31 11:31:21 essa-prototype kernel: [ 4723.967083] NVRM: rm_init_adapter(0) failed

Mar 31 11:31:24 essa-prototype kernel: [ 4726.847325] NVRM: RmInitAdapter failed! (0x27:0xffffffff:1052)

Mar 31 11:31:24 essa-prototype kernel: [ 4726.847336] NVRM: rm_init_adapter(0) failed

Mar 31 11:31:26 essa-prototype kernel: [ 4729.276171] NVRM: RmInitAdapter failed! (0x27:0xffffffff:1052)

Mar 31 11:31:26 essa-prototype kernel: [ 4729.276182] NVRM: rm_init_adapter(0) failed

Mar 31 11:32:22 essa-prototype kernel: [ 4784.764238] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 270.27 Fri Feb 18 17:36:20 PST 2011

Mar 31 11:33:04 essa-prototype kernel: [ 4826.575075] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 270.27 Fri Feb 18 17:36:20 PST 2011

Mar 31 11:33:05 essa-prototype kernel: [ 4827.671218] NVRM: RmInitAdapter failed! (0x27:0xffffffff:1052)

Mar 31 11:33:05 essa-prototype kernel: [ 4827.671229] NVRM: rm_init_adapter(0) failed

Mar 31 11:35:42 essa-prototype kernel: [ 11.993203] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 270.27 Fri Feb 18 17:36:20 PST 2011

Mar 31 12:30:04 essa-prototype kernel: [ 3266.969935] NVRM: RmInitAdapter failed! (0x27:0xffffffff:1052)

Mar 31 12:30:04 essa-prototype kernel: [ 3266.969946] NVRM: rm_init_adapter(0) failed

Mar 31 12:30:19 essa-prototype kernel: [ 3282.031983] NVRM: RmInitAdapter failed! (0x27:0xffffffff:1052)

Mar 31 12:30:19 essa-prototype kernel: [ 3282.031994] NVRM: rm_init_adapter(0) failed

Mar 31 12:32:26 essa-prototype kernel: [ 3407.967121] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 260.19.21 Thu Nov 4 21:16:27 PDT 2010

Mar 31 12:52:09 essa-prototype kernel: [ 4588.526967] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 260.19.21 Thu Nov 4 21:16:27 PDT 2010

Mar 31 12:52:10 essa-prototype kernel: [ 4589.578627] NVRM: RmInitAdapter failed! (0x27:0xffffffff:1046)

Mar 31 12:52:10 essa-prototype kernel: [ 4589.578638] NVRM: rm_init_adapter(0) failed

Mar 31 12:52:54 essa-prototype kernel: [ 4632.692633] NVRM: RmInitAdapter failed! (0x27:0xffffffff:1046)

Mar 31 12:52:54 essa-prototype kernel: [ 4632.692645] NVRM: rm_init_adapter(0) failed

Mar 31 12:53:25 essa-prototype kernel: [ 4663.998538] NVRM: RmInitAdapter failed! (0x27:0xffffffff:1046)

Mar 31 12:53:25 essa-prototype kernel: [ 4663.998549] NVRM: rm_init_adapter(0) failed

Mar 31 12:53:34 essa-prototype kernel: [ 4672.877601] NVRM: RmInitAdapter failed! (0x27:0xffffffff:1046)

Mar 31 12:53:34 essa-prototype kernel: [ 4672.877611] NVRM: rm_init_adapter(0) failed

Mar 31 12:55:11 essa-prototype kernel: [ 12.025800] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 260.19.21 Thu Nov 4 21:16:27 PDT 2010

Mar 31 13:17:32 essa-prototype kernel: [ 1350.078983] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context

Mar 31 13:17:36 essa-prototype kernel: [ 1354.068864] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context

Mar 31 13:17:38 essa-prototype kernel: [ 1356.063798] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context

Mar 31 16:17:23 essa-prototype kernel: [12113.839194] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 270.35 Fri Mar 18 11:48:56 PDT 2011

Mar 31 16:22:00 essa-prototype kernel: [12390.179489] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 270.35 Fri Mar 18 11:48:56 PDT 2011

Mar 31 16:23:56 essa-prototype kernel: [12505.340239] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context

Mar 31 16:23:58 essa-prototype kernel: [12507.335173] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context

Mar 31 16:24:00 essa-prototype kernel: [12509.330106] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context

Mar 31 16:24:02 essa-prototype kernel: [12511.325040] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context

Mar 31 16:24:04 essa-prototype kernel: [12513.319973] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context

Mar 31 16:24:06 essa-prototype kernel: [12515.314907] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context

Mar 31 16:24:08 essa-prototype kernel: [12517.309840] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context

Mar 31 16:24:10 essa-prototype kernel: [12519.304773] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context

Mar 31 16:24:12 essa-prototype kernel: [12521.299706] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context

Mar 31 16:24:14 essa-prototype kernel: [12523.294640] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context

Mar 31 16:24:16 essa-prototype kernel: [12525.289573] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context

Mar 31 16:24:18 essa-prototype kernel: [12527.284506] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context

Mar 31 16:24:20 essa-prototype kernel: [12529.279438] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context

Mar 31 16:24:22 essa-prototype kernel: [12531.274372] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context

Mar 31 16:24:24 essa-prototype kernel: [12533.273617] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context

Mar 31 16:24:26 essa-prototype kernel: [12535.268623] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context

Mar 31 16:24:30 essa-prototype kernel: [12539.258541] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context

Which motherboard/chipset? Is this a Magnycours system?
Could you open a bug?

No is not a Magnycours, it’s a double socket Supermicro board, using a 2 X5650.

The system is not new, using it since 1 year, it started to have this behave since CUDA4.0.

$ lspci

00:00.0 Host bridge: Intel Corporation 5520 I/O Hub to ESI Port (rev 22)

00:01.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 1 (rev 22)

00:03.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 3 (rev 22)

00:07.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 7 (rev 22)

00:13.0 PIC: Intel Corporation 5520/5500/X58 I/O Hub I/OxAPIC Interrupt Controller (rev 22)

00:14.0 PIC: Intel Corporation 5520/5500/X58 I/O Hub System Management Registers (rev 22)

00:16.0 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)

00:16.1 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)

00:16.2 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)

00:16.3 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)

00:16.4 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)

00:16.5 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)

00:16.6 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)

00:16.7 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)

00:1a.0 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #4

00:1a.1 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #5

00:1a.2 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #6

00:1a.7 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #2

00:1b.0 Audio device: Intel Corporation 82801JI (ICH10 Family) HD Audio Controller

00:1c.0 PCI bridge: Intel Corporation 82801JI (ICH10 Family) PCI Express Root Port 1

00:1c.4 PCI bridge: Intel Corporation 82801JI (ICH10 Family) PCI Express Root Port 5

00:1c.5 PCI bridge: Intel Corporation 82801JI (ICH10 Family) PCI Express Root Port 6

00:1d.0 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #1

00:1d.1 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #2

00:1d.2 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #3

00:1d.7 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #1

00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90)

00:1f.0 ISA bridge: Intel Corporation 82801JIR (ICH10R) LPC Interface Controller

00:1f.2 SATA controller: Intel Corporation 82801JI (ICH10 Family) SATA AHCI Controller

00:1f.3 SMBus: Intel Corporation 82801JI (ICH10 Family) SMBus Controller

02:00.0 VGA compatible controller: nVidia Corporation Device 06d1 (rev a3)

02:00.1 Audio device: nVidia Corporation Device 0be5 (rev a1)

03:00.0 VGA compatible controller: nVidia Corporation Device 06d1 (rev a3)

03:00.1 Audio device: nVidia Corporation Device 0be5 (rev a1)

05:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection

06:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection

07:01.0 VGA compatible controller: Matrox Graphics, Inc. MGA G200eW WPCM450 (rev 0a)

80:00.0 PCI bridge: Intel Corporation 5500 Non-Legacy I/O Hub PCI Express Root Port 0 (rev 22)

80:01.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 1 (rev 22)

80:03.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 3 (rev 22)

80:07.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 7 (rev 22)

80:13.0 PIC: Intel Corporation 5520/5500/X58 I/O Hub I/OxAPIC Interrupt Controller (rev 22)

80:14.0 PIC: Intel Corporation 5520/5500/X58 I/O Hub System Management Registers (rev 22)

80:14.1 PIC: Intel Corporation 5520/5500/X58 I/O Hub GPIO and Scratch Pad Registers (rev 22)

80:14.2 PIC: Intel Corporation 5520/5500/X58 I/O Hub Control Status and RAS Registers (rev 22)

80:14.3 PIC: Intel Corporation 5520/5500/X58 I/O Hub Throttle Registers (rev 22)

80:16.0 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)

80:16.1 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)

80:16.2 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)

80:16.3 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)

80:16.4 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)

80:16.5 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)

80:16.6 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)

80:16.7 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)

83:00.0 VGA compatible controller: nVidia Corporation Device 06d1 (rev a3)

83:00.1 Audio device: nVidia Corporation Device 0be5 (rev a1)

84:00.0 VGA compatible controller: nVidia Corporation Device 06d1 (rev a3)

84:00.1 Audio device: nVidia Corporation Device 0be5 (rev a1)

Sure I can open the bug, shall I use the “bug report” on https://nvdeveloper.nvidia.com ?

Gaetano

Yes,
please use the registered developer site.

I am having the exact same problem with the Nvidia S2050 with SLES 11.0 sp1. The difference for me is I am using 260.19.44 + CUDA 3.2 I just posted 2 days ago on this, here is my post:

CUDA Driver Problems for Tesla S2050 with SLES 11 sp1

Submitted.

Last night regression tests passed all, and no NVRM messages on the log.

On the same machine normally there is up another process, not used by anyone during the night,

and the only interaction (when not used) with GPU is that has some memory allocated on the GPU.

Last night that process was not running; I will try to run even during the day our regression

tests to see how it goes, and I will try this afternoon, if all goes well, to start that process

to see if the error then comes out.

Gaetano

PS: The compute mode is: Default

This weekend the system stopped to work, even using the old driver and cuda3.2 and even after
a cold reboot. We were able then removing one by one the C2050 installed to find the faulty one.

Electrically seems working and recognized by the system:

$ lspci | grep -i nvidia
02:00.0 VGA compatible controller: nVidia Corporation Device 06d1 (rev a3)
02:00.1 Audio device: nVidia Corporation Device 0be5 (rev a1)

but our software either the nvidia tool are not able to communicate with it:

$ nvidia-smi -L -a
NVIDIA: could not open the device file /dev/nvidia0 (Input/output error).
Failed to attach gpu

You should have a look at the kernel ring buffer and see what errors are reported by the driver. That could be the basis of either a bug report or an rma request. I am rather interested in those atomic interrupt errors from the driver. I have been getting those pretty regularly on an AMD 790fx board I have been testing the second 4.arc driver release.

In the kernel ring buffer about the NVRM I can find this:

Apr 4 11:26:47 essa-prototype kernel: [ 13.188846] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 260.19.21 Thu Nov 4 21:16:27 PDT 2010

Apr 4 11:28:14 essa-prototype kernel: [ 100.452314] NVRM: RmInitAdapter failed! (0x27:0xffffffff:1046)

Apr 4 11:28:14 essa-prototype kernel: [ 100.452431] NVRM: rm_init_adapter(2) failed

can it be that the bios on the CARD was destroyed somehow using CUDA 4.0 and the 270.35 drivers ?

I hadn’t visited my code in a week, but I came back tonight and I was having similar

problems. I wondered what on earth I had done to corrupt my machine’s configuration

so badly.

[font=“Courier New”]lshw[/font] showed that both 430GT cards were attached, and not fried.

The only thing I’d done within the last week is install unrelated software and do system

upgrades through [font=“Courier New”]apt-get[/font] (btw, I’m developing on a head-less Ubuntu Server 10.10).

I figured I would reinstall the drivers and troubleshoot from there. I found that the

4.0 RC2 SDK and 270.40 drivers were released tonight. I uninstalled all things CUDA

related and installed the new drivers and SDK. Any change? Nope.

I tried executing [font=“Courier New”]nvidia-smi -L[/font], but that actually error’ed out too. But, running

it with elevated [font=“Courier New”]sudo[/font] privileges did something. My guess is that there were no

symlinks in [font=“Courier New”]/dev[/font] and [font=“Courier New”]nvidia-smi[/font] will attempt to create them if they don’t exist.

Running [font=“Courier New”]deviceQuery[/font] now gave me the expected results. I’ve rebooted the machine

a few times and have found that I need to execute [font=“Courier New”]sudo nvidia-smi -L[/font] to force

those symlink creations.

Now back to writing some more code and looking to see if the Visual Profiler still has that buffer

overflow bug. Hopefully this was helpful to someone.

Here’s a summary:

11:55 PM: ./deviceQuery

[./deviceQuery] starting...

./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount FAILED CUDA Driver and Runtime version may be mismatched.

[deviceQuery] test results...

FAILED

11:56 PM: sudo nvidia-smi -L

[sudo] password for kevin: 

GPU 0: GeForce GT 430 (UUID: N/A)

GPU 1: GeForce GT 430 (UUID: N/A)

11:56 PM: ./deviceQuery

[./deviceQuery] starting...

./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

There are 2 devices supporting CUDA

Device 0: "GeForce GT 430"

...

Hey I think you need a proper start up script, as suggested in Getting_Started_Linux.pdf:

#!/bin/bash 

/sbin/modprobe nvidia

if [ "$?" -eq 0 ]; then

 # Count the number of NVIDIA controllers found.

 NVDEVS=`lspci | grep -i NVIDIA`

 N3D=`echo "$NVDEVS" | grep "3D controller" | wc -l`

 NVGA=`echo "$NVDEVS" | grep "VGA compatible controller" | wc -l`

 N=`expr $N3D + $NVGA - 1`

 for i in `seq 0 $N`; do

  mknod -m 666 /dev/nvidia$i c 195 $i

 done

 mknod -m 666 /dev/nvidiactl c 195 255

else

 exit 1

fi