CUDA beta 1.0-9751 driver problem with dual cards Can't get CUDA working on a 2 card box

tachyon_john · February 24, 2007, 12:02am

I’ve had CUDA running successfully on our test machine for a few days, but today we put in a second GPU, and now it’s no longer working. The symptoms are that the CUDA apps print the following messages at startup:
NVIDIA: could not open the device file /dev/nvidia1 (Input/output error).

The device nodes are there and X
ls -al /dev/nv*
crw-rw-rw- 1 root root 195, 0 Feb 23 11:50 /dev/nvidia0
crw-rw-rw- 1 root root 195, 1 Feb 23 11:50 /dev/nvidia1
crw-rw-rw- 1 root root 195, 255 Feb 23 11:50 /dev/nvidiactl

glxinfo etc do identify the running X driver version as 1.0-9751, so it does appear to
be installed correctly.

The system logs include some disturbing looking NVRM messages however:
NVRM: RmInitAdapter failed! (0x25:0xffffffff:1011)
NVRM: rm_init_adapter(1) failed
allocation failed: out of vmalloc space - use vmalloc= to increase size.
NVRM: RmInitAdapter failed! (0x25:0xffffffff:1011)
NVRM: rm_init_adapter(1) failed
allocation failed: out of vmalloc space - use vmalloc= to increase size.
NVRM: RmInitAdapter failed! (0x25:0xffffffff:1011)
NVRM: rm_init_adapter(1) failed
NVRM: RmInitAdapter failed! (0x25:0xffffffff:1011)
NVRM: rm_init_adapter(1) failed
NVRM: RmInitAdapter failed! (0x25:0xffffffff:1011)
NVRM: rm_init_adapter(1) failed

Suggestions?

netllama · February 24, 2007, 12:06am

Looks like you’ve exhausted vmalloc space:
allocation failed: out of vmalloc space - use vmalloc= to increase size.

See the NVIDIA driver README’s discussion of this issue.

Thanks,
Lonni

tachyon_john · February 24, 2007, 12:07am

I should mention that this is on an Intel quad core machine, with an ASUS P5N32-E SLI motherboard,
8GB of RAM, RHEL4u4 SMP kernel. I’m attaching the nvidia-bug-report.log file to this message.
nvidia_bug_report.log.gz (29.8 KB)

tachyon_john · February 24, 2007, 12:13am

Lonni,
Where are the plain text README files hidden these days, searching the HTML driver README is a fate worse than death when you’re looking for stuff like this.

john

tachyon_john · February 24, 2007, 12:41am

Lonni,
I followed the suggestions in the driver README, adding “vmalloc=256MB” (which also required the grub “uppermem 524288” command to prevent grub from miscalculating), and the kernel boots and gets as far as the graphical boot progress screen and then the entire machine locks up hard.

John

tachyon_john · February 24, 2007, 6:27am

After digging around some more, it looks like the next thing to try is booting the machine with some more kernel flags, the “pci=nommconf” looks like a possibility,
and I’ll give “idle=poll” a try as well if that doesn’t solve the hard lockup. I’ll report back if either one of these influence the results I’m getting…

John

tachyon_john · February 24, 2007, 4:14pm

:) :) :)

I got the system to boot and it now let’s me run CUDA again. I have two GeForce 8800GTX cards installed. CUDA only “sees” one of the cards, but I’m no longer getting kernel errors during boot. I’ll do more testing to verify that SLI is working shortly. The boot options that ended up making the machine run were:
uppermem 524288 (GRUB option to workaround a bug)
vmalloc=256MB pci=nommconf

I’ll see how stable this is and report back after I do more testing. This info may be remedial for people that are used to running 32-bit kernels, but this is the first time we’ve setup a 32-bit install for several years, so it has been interesting navigating around these problems.

Cheers,
John

tachyon_john · February 24, 2007, 4:35pm

Well, SLI doesn’t seem to want to work for some reason, the error given is simply:

“SLI is not yet supported in this configuration.”

Here is the pertinent part of the X startup log:

(**) NVIDIA(0): Depth 24, (--) framebuffer bpp 32

(==) NVIDIA(0): RGB weight 888

(==) NVIDIA(0): Default visual is TrueColor

(==) NVIDIA(0): Using gamma correction (1.0, 1.0, 1.0)

(**) NVIDIA(0): Option "SLI" "on"

(**) NVIDIA(0): Enabling RENDER acceleration

(**) NVIDIA(0): NVIDIA SLI auto-select rendering option.

(WW) NVIDIA(0): DamageEvents are not currently compatible with SLI.  Disabling

(WW) NVIDIA(0):     DamageEvents.

(II) NVIDIA(0): NVIDIA SLI enabled.

(EE) NVIDIA(0): SLI is not yet supported in this configuration.  Only one GPU

(EE) NVIDIA(0):     will be used for this X screen.

(II) NVIDIA(0): NVIDIA GPU GeForce 8800 GTX at PCI:1:0:0 (GPU-0)

(--) NVIDIA(0): Memory: 786432 kBytes

(--) NVIDIA(0): VideoBIOS: 60.80.08.00.35

(II) NVIDIA(0): Detected PCI Express Link width: 16X

(--) NVIDIA(0): Interlaced video modes are supported on this GPU

(--) NVIDIA(0): Connected display device(s) on GeForce 8800 GTX at PCI:1:0:0:

(--) NVIDIA(0):     Samsung 170T Analog (CRT-1)

(--) NVIDIA(0): Samsung 170T Analog (CRT-1): 400.0 MHz maximum pixel clock

(II) NVIDIA(0): Assigned Display Device: CRT-1

(II) NVIDIA(0): Validated modes:

(II) NVIDIA(0):     "1280x1024"

(II) NVIDIA(0): Virtual screen size determined to be 1280 x 1024

(--) NVIDIA(0): DPI set to (95, 96); computed from "UseEdidDpi" X config

(--) NVIDIA(0):     option

(--) Depth 24 pixmap format is 32 bpp

Anything obvious I can do to correct that?

tachyon_john · February 24, 2007, 4:40pm

Disabling SLI makes both GPUs show up in CUDA, so it would appear that as far as CUDA goes, I’m up and running now. I’ll have to do more tests and see if I can get SLI working, and ideally I’ll see if I can squeeze another GPU in this machine (it has three PCI-E x16 slots…) so long as the power supply can hack it…

Cheers,
John

tachyon_john · February 24, 2007, 5:46pm

For anyone else using the RHEL4u4 with a dual GeForce8800GTX on a similar motherboard, here are the default boot flags I’m sending to the kernel in my current /boot/grub/grub.conf:

title Red Hat Desktop (2.6.9-42.ELsmp) (Dual GPU SLI CUDA)
root (hd0,0)
uppermem 524288
kernel /vmlinuz-2.6.9-42.ELsmp ro root=LABEL=/1 rhgb quiet vmalloc=256MB
pci=nommconf
initrd /initrd-2.6.9-42.ELsmp.img

The machine is much happier now… :)

John

tachyon_john · February 25, 2007, 5:04am

Following this same recipe I now have the same machine working with 3 8800GTX cards, all of them operating just fine with CUDA. I still haven’t figured out why SLI won’t work, but since this is intended for CUDA development I’ll leave that for another day.

John

netllama · February 25, 2007, 3:50pm

SLI isn’t yet supported with G80 based cards, in the graphics driver that you’re using. The next graphics driver release will support G80 SLI.

BTW, the plaintext graphics driver README is installed in /usr/share/doc/NVIDIA_GLX-1.0/README.txt

Thanks,
Lonni

jamesmilne · February 28, 2007, 12:37pm

SLI also won’t make any difference to your CUDA code. It doesn’t magically spread the workload across 2 or more GPUs. If you want to utilise CUDA for two GPUs, you’ll need to set up two CUDA contexts, and assign one to each card.

–

Kind regards,

James Milne

tachyon_john · February 28, 2007, 11:07pm

James,

Yes, the SLI question was actually related to rendering the results after the CUDA phase of the calculation is done, but as you say it’s an orthogonal issue to the CUDA runs themselves. I do have 3 GPUs working concurrently in CUDA presently with one thread for each, which is working marvelously well. I’m loving this test machine, I dare not say in public how much faster the GPU code is running than the CPU version on various large machines, at least not until I publish :-)

John

jimh · September 18, 2007, 11:39pm

John, thanks for posting this solution. I now have my multi-GPU box up and running with this information.

Jim

tachyon_john · September 21, 2007, 2:09am

I’m glad that others are finding it useful.

Cheers,

John

Topic		Replies	Views
Ubuntu 9.04 - Cuda 2.3 - no device supporting CUDA SLI GTX cards are not recognized by cuda runtime CUDA Programming and Performance	14	22052	October 23, 2009
dual GTX280M problem Alienware M17x, win7 x64 CUDA Programming and Performance	2	5322	January 14, 2010
There is no device supporting CUDA CUDA Programming and Performance	11	22720	April 24, 2008
CUDA 4 + driver 270.35 (C2050) random errors CUDA Programming and Performance	13	18699	April 7, 2011
CUDA Driver and Runtime version mismatch problem CUDA Programming and Performance	15	20308	September 20, 2010
CUDA 2.2 still fails to enumerate devices on 9200/9400 Hybrid SLI system. "could not open the de CUDA Programming and Performance	12	21586	November 7, 2009
Ubuntu 8.10 + 2 GeForce 8600 GT CUDA Programming and Performance	6	5903	December 7, 2008
Install problems s870 on openSUSE 10.3 CUDA Programming and Performance	3	4288	December 8, 2008
CUDA supporting SLI on Linux? CUDA support for SLI CUDA Programming and Performance	2	6549	April 3, 2008
S870 causes kernel panic Device query of S870 crashes kernel CUDA Programming and Performance	27	25667	May 29, 2008

CUDA beta 1.0-9751 driver problem with dual cards Can't get CUDA working on a 2 card box

Related topics