Getting all GPUs to work

My system consists of 3 Tesla cards. 1 c2050 and 2 c1060’s

I’m on a fresh Ubuntu 9.1 install

I installed the latest NVIDIA-Linux-x86_64-256.35.run

x-server only sees the c2050

How can I get it to recognize the other 2 cards?

pluMmet –

I don’t have a specific answer to your question unfortunately, but I wanted to add to your post that I have been having the exact same problem with these new Fermi GPU devices, whether they are in Tesla cards or video cards such as the GTX 480. Yours is the first post that I’ve seen anywhere that identifies this problem where nVidia’s current Linux driver does not seem to work when you have more than one GPU in a system and one of the GPU’s is a Fermi.

Today I have verified this problem on a second of two brand new systems that provide two PCI Express 16X slots. The systems are the Alienware Aurora, and I’ve tried every possible combination of video card and Tesla devices using four different flavors of CUDA supported Linux:

Configurations causing GPU device detection failures involving the Fermi using Fedora Core 10 and 12 and Ubuntu 9.1 and 10 –

  1. nvidia 9800 GTX in PCI slot 1 and Tesla c2050 in PCI slot 2 ==> X desktop works but Tesla not detected by driver.

  2. GTX 480 in PCI slot1 and c2050 Tesla in PCI slot 2 ==> X does not start; Xorg errors with missing screens, etc.

  3. Tesla c2050 in PCI slot1 (functioning as the video card) and a GTX 480 in PCI slot 1 ==> X does not start; Xorg errors with missing screens, etc.

  4. Tesla c2050 in PCI slot1 (functioning as the video card) and a nvidia 9800 GTX in PCI slot 2 ==> X desktop works and the driver sees the Tesla but doesn’t detect the 9800. However CUDA does NOT work giving various error messages of devices and files not found, etc.

  5. GTX 480 in PCI slot1 and a nvidia 9800 GTX in PCI slot 2 ==> X desktop works and the driver sees the GTX 480 but doesn’t detect the 9800. Again, CUDA does NOT work giving various error messages of devices and files not found, etc.

  6. nvidia 9800 GTX in PCI slot 1 and initially no second GPU device ==> X desktop works and CUDA works, but then if you shut down the system and stick in the GTX 480 or a Tesla, X starts but CUDA stops working. Pull out the GTX 480 or Tesla and CUDA goes back to working again. The same thing happens if you start off with just the GTX 480 or Tesla c2050 in the first slot and stick in the 9800 in the second slot; i.e. CUDA stops working. If a second Fermi device is put in the second slot then X won’t start.

The only case where we could use two GPU’s was when they were both from the previous generation (G200):

  1. nvidia 9800 GTX in PCI slot 1 and Tesla c1060 in PCI slot 2 ==> X starts, both devices are detected by the driver, CUDA works.

All of these cases were conducted using the latest x86-64 256.53 Linux driver. Also, in the cases where X was able to start, the system command lspci did show BOTH nvidia devices were visible on the PCI bus.

Basically the current driver seems to only work with ONE Fermi GPU so as long as its in the first PCI slot so it can be seen and used as the video card. Nvidia’s driver has a long history of looking in the first PCI slot for a compatible nvidia device. If it doesn’t find one there it apparently doesn’t look in the second slot for an nvidia device and doesn’t load. This goes way back to the old problem where you try to use a non nvidia or low end non CUDA compatible video card with a Tesla device and CUDA runs only in emulation mode because the nvidia video driver never loads.

Sorry for this long winded reply the doesn’t solve the problem. I’m hoping that someone will read this and either prove me wrong or say that they are having similar problems so this issue can be presented to nvidia for a solution. I have a long list of other attempts to try to fix this problem if anyone wants more details. The Fermi is an incredible chip and we would really like to put it in production as a ultra high end computing device.

Regards,

mas6700

Just to add a data point in the other direction: I have a system running Ubuntu 9.04 with the CUDA 3.1 driver and toolkit that has a GTX 470 and three GTX 295 cards in it. All 7 CUDA devices are visible in deviceQuery and work as expected. Your problem is probably more subtle than just the presence of Fermi and older cards.

No problems here. Running Ubuntu 10.4 on two very different motherboards. One an AMD with onboard 8200 video, one an i7 p6T board.
On the i7 machine, I have a GTX295, a GTX480, and a GT240 all seen and used simultaneously.

On the AMD machine, I have a GTX295, two GTX480, and the onboard 8200 , and all are seen.

Thank you mas6700!

I can see that there are others that have some sort of working systems but thanks to your post I can see that there are indeed problems that aren’t mine alone.

Using Ubuntu is an alternative that I was trying because I’m having the same troubles with XP64 and the program that I use can use either OS.

XP64 is fairing better however as I can get one of the c1060s to appear sporadically.

I too have verified that all cards work. When I do get XP64 to recognize the first c1060 Device Manager shows my other c1060 as unknown device. When neither c1060 is working it shows them as HD audio devices. I spent 10 hours on this with XP64 trying various things. Then I switched over to Ubuntu only to have neither c1060 recognized.

The part of the equation that we share and the other in this thread have not is that we are using the c2050.

You also have a problem with the 480 and 9800 which is not a shared configuration by our other reporters.

This as far as I can tell is a problem with our available drivers.

I’m guessing that most people in this section of the forum are programmers but I’m in this on the CG animation side of things. Image rendering programs are just now exploding with the ability to use CUDA to decrease render times. Artists are not as likely to be as savvy with all these problems. I hope nVidia delivers some proper drivers soon.

Any resolution of this issue?

I have an Alienware Area-51 running openSuSE 11.2 x86_64, dev driver 256.40, with a GeForce 480 in slot 1 and a C2050 in slot 2. Everything works fine without the Tesla, but once that is added I get errors “No devices detected” and “no screens found” and X will not start.

This seems identical to the issue reported above.

Thanks,
Matt

Any resolution of this issue?

I have an Alienware Area-51 running openSuSE 11.2 x86_64, dev driver 256.40, with a GeForce 480 in slot 1 and a C2050 in slot 2. Everything works fine without the Tesla, but once that is added I get errors “No devices detected” and “no screens found” and X will not start.

This seems identical to the issue reported above.

Thanks,
Matt

To follow up with what allowed X to work for me (I have yet to test CUDA):

Manually edit the xorg.conf file to explicitly identify the graphics card to use for X.

Thus in the Device section I added a line BusID (edit for your PCI configuration):

Section “Device”

  [...]

  Identifier "GF480"

  BusID   "PCI:9:0:0"

EndSection

And in the Screen section I made sure that the Device above was explicitly specified:

Section “Screen”

[...]

Device "GF480"

EndSection

Pretty basic X configuration stuff, but I only fiddle with X every two or three years. I’m a bit surprised that nvidia-xconfig doesn’t handle this.

  • Matt

To follow up with what allowed X to work for me (I have yet to test CUDA):

Manually edit the xorg.conf file to explicitly identify the graphics card to use for X.

Thus in the Device section I added a line BusID (edit for your PCI configuration):

Section “Device”

  [...]

  Identifier "GF480"

  BusID   "PCI:9:0:0"

EndSection

And in the Screen section I made sure that the Device above was explicitly specified:

Section “Screen”

[...]

Device "GF480"

EndSection

Pretty basic X configuration stuff, but I only fiddle with X every two or three years. I’m a bit surprised that nvidia-xconfig doesn’t handle this.

  • Matt

As a final follow-up, this is a known bug in X that can be worked around as I stated, or by specifying --enable-all-gpus to nvidia-xconfig. This is documented in the driver README file available when you download the production drivers:

http://us.download.nvidia.com/XFree86/Linu…nownissues.html

  • Matt

As a final follow-up, this is a known bug in X that can be worked around as I stated, or by specifying --enable-all-gpus to nvidia-xconfig. This is documented in the driver README file available when you download the production drivers:

http://us.download.nvidia.com/XFree86/Linu…nownissues.html

  • Matt

Your X server recognises only the c2050 because that is the only card with a graphics output. The C1060s are purely CUDA processors.

The X server has nothing to do with running CUDA programs just with dispalying on the screen(s), so it will use the only available device.

(Actually it is more of a problem if you have several cards with graphics outputs - telling X which to use and CUDA which to use).

What happens when you run deviceQuery? If it lists all three devices, then all are available for CUDA processing.

Your X server recognises only the c2050 because that is the only card with a graphics output. The C1060s are purely CUDA processors.

The X server has nothing to do with running CUDA programs just with dispalying on the screen(s), so it will use the only available device.

(Actually it is more of a problem if you have several cards with graphics outputs - telling X which to use and CUDA which to use).

What happens when you run deviceQuery? If it lists all three devices, then all are available for CUDA processing.