CUDA beta 1.0-9751 driver problem with dual cards Can't get CUDA working on a 2 card box

I’ve had CUDA running successfully on our test machine for a few days, but today we put in a second GPU, and now it’s no longer working. The symptoms are that the CUDA apps print the following messages at startup:
NVIDIA: could not open the device file /dev/nvidia1 (Input/output error).

The device nodes are there and X
ls -al /dev/nv*
crw-rw-rw- 1 root root 195, 0 Feb 23 11:50 /dev/nvidia0
crw-rw-rw- 1 root root 195, 1 Feb 23 11:50 /dev/nvidia1
crw-rw-rw- 1 root root 195, 255 Feb 23 11:50 /dev/nvidiactl

glxinfo etc do identify the running X driver version as 1.0-9751, so it does appear to
be installed correctly.

The system logs include some disturbing looking NVRM messages however:
NVRM: RmInitAdapter failed! (0x25:0xffffffff:1011)
NVRM: rm_init_adapter(1) failed
allocation failed: out of vmalloc space - use vmalloc= to increase size.
NVRM: RmInitAdapter failed! (0x25:0xffffffff:1011)
NVRM: rm_init_adapter(1) failed
allocation failed: out of vmalloc space - use vmalloc= to increase size.
NVRM: RmInitAdapter failed! (0x25:0xffffffff:1011)
NVRM: rm_init_adapter(1) failed
NVRM: RmInitAdapter failed! (0x25:0xffffffff:1011)
NVRM: rm_init_adapter(1) failed
NVRM: RmInitAdapter failed! (0x25:0xffffffff:1011)
NVRM: rm_init_adapter(1) failed

Suggestions?

Looks like you’ve exhausted vmalloc space:
allocation failed: out of vmalloc space - use vmalloc= to increase size.

See the NVIDIA driver README’s discussion of this issue.

Thanks,
Lonni

I should mention that this is on an Intel quad core machine, with an ASUS P5N32-E SLI motherboard,
8GB of RAM, RHEL4u4 SMP kernel. I’m attaching the nvidia-bug-report.log file to this message.
nvidia_bug_report.log.gz (29.8 KB)

Lonni,
Where are the plain text README files hidden these days, searching the HTML driver README is a fate worse than death when you’re looking for stuff like this.

john

Lonni,
I followed the suggestions in the driver README, adding “vmalloc=256MB” (which also required the grub “uppermem 524288” command to prevent grub from miscalculating), and the kernel boots and gets as far as the graphical boot progress screen and then the entire machine locks up hard.

John

After digging around some more, it looks like the next thing to try is booting the machine with some more kernel flags, the “pci=nommconf” looks like a possibility,
and I’ll give “idle=poll” a try as well if that doesn’t solve the hard lockup. I’ll report back if either one of these influence the results I’m getting…

John

:) :) :)

I got the system to boot and it now let’s me run CUDA again. I have two GeForce 8800GTX cards installed. CUDA only “sees” one of the cards, but I’m no longer getting kernel errors during boot. I’ll do more testing to verify that SLI is working shortly. The boot options that ended up making the machine run were:
uppermem 524288 (GRUB option to workaround a bug)
vmalloc=256MB pci=nommconf

I’ll see how stable this is and report back after I do more testing. This info may be remedial for people that are used to running 32-bit kernels, but this is the first time we’ve setup a 32-bit install for several years, so it has been interesting navigating around these problems.

Cheers,
John

Well, SLI doesn’t seem to want to work for some reason, the error given is simply:

“SLI is not yet supported in this configuration.”

Here is the pertinent part of the X startup log:

(**) NVIDIA(0): Depth 24, (--) framebuffer bpp 32

(==) NVIDIA(0): RGB weight 888

(==) NVIDIA(0): Default visual is TrueColor

(==) NVIDIA(0): Using gamma correction (1.0, 1.0, 1.0)

(**) NVIDIA(0): Option "SLI" "on"

(**) NVIDIA(0): Enabling RENDER acceleration

(**) NVIDIA(0): NVIDIA SLI auto-select rendering option.

(WW) NVIDIA(0): DamageEvents are not currently compatible with SLI.  Disabling

(WW) NVIDIA(0):     DamageEvents.

(II) NVIDIA(0): NVIDIA SLI enabled.

(EE) NVIDIA(0): SLI is not yet supported in this configuration.  Only one GPU

(EE) NVIDIA(0):     will be used for this X screen.

(II) NVIDIA(0): NVIDIA GPU GeForce 8800 GTX at PCI:1:0:0 (GPU-0)

(--) NVIDIA(0): Memory: 786432 kBytes

(--) NVIDIA(0): VideoBIOS: 60.80.08.00.35

(II) NVIDIA(0): Detected PCI Express Link width: 16X

(--) NVIDIA(0): Interlaced video modes are supported on this GPU

(--) NVIDIA(0): Connected display device(s) on GeForce 8800 GTX at PCI:1:0:0:

(--) NVIDIA(0):     Samsung 170T Analog (CRT-1)

(--) NVIDIA(0): Samsung 170T Analog (CRT-1): 400.0 MHz maximum pixel clock

(II) NVIDIA(0): Assigned Display Device: CRT-1

(II) NVIDIA(0): Validated modes:

(II) NVIDIA(0):     "1280x1024"

(II) NVIDIA(0): Virtual screen size determined to be 1280 x 1024

(--) NVIDIA(0): DPI set to (95, 96); computed from "UseEdidDpi" X config

(--) NVIDIA(0):     option

(--) Depth 24 pixmap format is 32 bpp

Anything obvious I can do to correct that?

Disabling SLI makes both GPUs show up in CUDA, so it would appear that as far as CUDA goes, I’m up and running now. I’ll have to do more tests and see if I can get SLI working, and ideally I’ll see if I can squeeze another GPU in this machine (it has three PCI-E x16 slots…) so long as the power supply can hack it…

Cheers,
John

For anyone else using the RHEL4u4 with a dual GeForce8800GTX on a similar motherboard, here are the default boot flags I’m sending to the kernel in my current /boot/grub/grub.conf:

title Red Hat Desktop (2.6.9-42.ELsmp) (Dual GPU SLI CUDA)
root (hd0,0)
uppermem 524288
kernel /vmlinuz-2.6.9-42.ELsmp ro root=LABEL=/1 rhgb quiet vmalloc=256MB
pci=nommconf
initrd /initrd-2.6.9-42.ELsmp.img

The machine is much happier now… :)

John

Following this same recipe I now have the same machine working with 3 8800GTX cards, all of them operating just fine with CUDA. I still haven’t figured out why SLI won’t work, but since this is intended for CUDA development I’ll leave that for another day.

John

SLI isn’t yet supported with G80 based cards, in the graphics driver that you’re using. The next graphics driver release will support G80 SLI.

BTW, the plaintext graphics driver README is installed in /usr/share/doc/NVIDIA_GLX-1.0/README.txt

Thanks,
Lonni

SLI also won’t make any difference to your CUDA code. It doesn’t magically spread the workload across 2 or more GPUs. If you want to utilise CUDA for two GPUs, you’ll need to set up two CUDA contexts, and assign one to each card.

Kind regards,

James Milne

James,

Yes, the SLI question was actually related to rendering the results after the CUDA phase of the calculation is done, but as you say it’s an orthogonal issue to the CUDA runs themselves. I do have 3 GPUs working concurrently in CUDA presently with one thread for each, which is working marvelously well. I’m loving this test machine, I dare not say in public how much faster the GPU code is running than the CPU version on various large machines, at least not until I publish :-)

John

John, thanks for posting this solution. I now have my multi-GPU box up and running with this information.

Jim

I’m glad that others are finding it useful.

Cheers,

John