I’ve had CUDA running successfully on our test machine for a few days, but today we put in a second GPU, and now it’s no longer working. The symptoms are that the CUDA apps print the following messages at startup:
NVIDIA: could not open the device file /dev/nvidia1 (Input/output error).
The device nodes are there and X
ls -al /dev/nv*
crw-rw-rw- 1 root root 195, 0 Feb 23 11:50 /dev/nvidia0
crw-rw-rw- 1 root root 195, 1 Feb 23 11:50 /dev/nvidia1
crw-rw-rw- 1 root root 195, 255 Feb 23 11:50 /dev/nvidiactl
glxinfo etc do identify the running X driver version as 1.0-9751, so it does appear to
be installed correctly.
The system logs include some disturbing looking NVRM messages however:
NVRM: RmInitAdapter failed! (0x25:0xffffffff:1011)
NVRM: rm_init_adapter(1) failed
allocation failed: out of vmalloc space - use vmalloc= to increase size.
NVRM: RmInitAdapter failed! (0x25:0xffffffff:1011)
NVRM: rm_init_adapter(1) failed
allocation failed: out of vmalloc space - use vmalloc= to increase size.
NVRM: RmInitAdapter failed! (0x25:0xffffffff:1011)
NVRM: rm_init_adapter(1) failed
NVRM: RmInitAdapter failed! (0x25:0xffffffff:1011)
NVRM: rm_init_adapter(1) failed
NVRM: RmInitAdapter failed! (0x25:0xffffffff:1011)
NVRM: rm_init_adapter(1) failed
I should mention that this is on an Intel quad core machine, with an ASUS P5N32-E SLI motherboard,
8GB of RAM, RHEL4u4 SMP kernel. I’m attaching the nvidia-bug-report.log file to this message. nvidia_bug_report.log.gz (29.8 KB)
Lonni,
Where are the plain text README files hidden these days, searching the HTML driver README is a fate worse than death when you’re looking for stuff like this.
Lonni,
I followed the suggestions in the driver README, adding “vmalloc=256MB” (which also required the grub “uppermem 524288” command to prevent grub from miscalculating), and the kernel boots and gets as far as the graphical boot progress screen and then the entire machine locks up hard.
After digging around some more, it looks like the next thing to try is booting the machine with some more kernel flags, the “pci=nommconf” looks like a possibility,
and I’ll give “idle=poll” a try as well if that doesn’t solve the hard lockup. I’ll report back if either one of these influence the results I’m getting…
I got the system to boot and it now let’s me run CUDA again. I have two GeForce 8800GTX cards installed. CUDA only “sees” one of the cards, but I’m no longer getting kernel errors during boot. I’ll do more testing to verify that SLI is working shortly. The boot options that ended up making the machine run were:
uppermem 524288 (GRUB option to workaround a bug)
vmalloc=256MB pci=nommconf
I’ll see how stable this is and report back after I do more testing. This info may be remedial for people that are used to running 32-bit kernels, but this is the first time we’ve setup a 32-bit install for several years, so it has been interesting navigating around these problems.
Well, SLI doesn’t seem to want to work for some reason, the error given is simply:
“SLI is not yet supported in this configuration.”
Here is the pertinent part of the X startup log:
(**) NVIDIA(0): Depth 24, (--) framebuffer bpp 32
(==) NVIDIA(0): RGB weight 888
(==) NVIDIA(0): Default visual is TrueColor
(==) NVIDIA(0): Using gamma correction (1.0, 1.0, 1.0)
(**) NVIDIA(0): Option "SLI" "on"
(**) NVIDIA(0): Enabling RENDER acceleration
(**) NVIDIA(0): NVIDIA SLI auto-select rendering option.
(WW) NVIDIA(0): DamageEvents are not currently compatible with SLI. Disabling
(WW) NVIDIA(0): DamageEvents.
(II) NVIDIA(0): NVIDIA SLI enabled.
(EE) NVIDIA(0): SLI is not yet supported in this configuration. Only one GPU
(EE) NVIDIA(0): will be used for this X screen.
(II) NVIDIA(0): NVIDIA GPU GeForce 8800 GTX at PCI:1:0:0 (GPU-0)
(--) NVIDIA(0): Memory: 786432 kBytes
(--) NVIDIA(0): VideoBIOS: 60.80.08.00.35
(II) NVIDIA(0): Detected PCI Express Link width: 16X
(--) NVIDIA(0): Interlaced video modes are supported on this GPU
(--) NVIDIA(0): Connected display device(s) on GeForce 8800 GTX at PCI:1:0:0:
(--) NVIDIA(0): Samsung 170T Analog (CRT-1)
(--) NVIDIA(0): Samsung 170T Analog (CRT-1): 400.0 MHz maximum pixel clock
(II) NVIDIA(0): Assigned Display Device: CRT-1
(II) NVIDIA(0): Validated modes:
(II) NVIDIA(0): "1280x1024"
(II) NVIDIA(0): Virtual screen size determined to be 1280 x 1024
(--) NVIDIA(0): DPI set to (95, 96); computed from "UseEdidDpi" X config
(--) NVIDIA(0): option
(--) Depth 24 pixmap format is 32 bpp
Disabling SLI makes both GPUs show up in CUDA, so it would appear that as far as CUDA goes, I’m up and running now. I’ll have to do more tests and see if I can get SLI working, and ideally I’ll see if I can squeeze another GPU in this machine (it has three PCI-E x16 slots…) so long as the power supply can hack it…
For anyone else using the RHEL4u4 with a dual GeForce8800GTX on a similar motherboard, here are the default boot flags I’m sending to the kernel in my current /boot/grub/grub.conf:
title Red Hat Desktop (2.6.9-42.ELsmp) (Dual GPU SLI CUDA)
root (hd0,0)
uppermem 524288
kernel /vmlinuz-2.6.9-42.ELsmp ro root=LABEL=/1 rhgb quiet vmalloc=256MB
pci=nommconf
initrd /initrd-2.6.9-42.ELsmp.img
Following this same recipe I now have the same machine working with 3 8800GTX cards, all of them operating just fine with CUDA. I still haven’t figured out why SLI won’t work, but since this is intended for CUDA development I’ll leave that for another day.
SLI also won’t make any difference to your CUDA code. It doesn’t magically spread the workload across 2 or more GPUs. If you want to utilise CUDA for two GPUs, you’ll need to set up two CUDA contexts, and assign one to each card.
Yes, the SLI question was actually related to rendering the results after the CUDA phase of the calculation is done, but as you say it’s an orthogonal issue to the CUDA runs themselves. I do have 3 GPUs working concurrently in CUDA presently with one thread for each, which is working marvelously well. I’m loving this test machine, I dare not say in public how much faster the GPU code is running than the CPU version on various large machines, at least not until I publish :-)