Nvidia DevBox CUDA 8.0 install recognizes 1/2 Titan X graphics card

After upgrading to Ubuntu 16 and installing Cuda none of the samples are running. nvidia-smi only shows half of the GPUs in this machine and device query returns an error as shown below. Please advise.

Here is the output from nvidia-smi

nvidia-smi -L
GPU 0: GeForce GTX TITAN X (UUID: GPU-26b3d94a-384c-e532-774e-4385fde429a0)
Unable to determine the device handle for gpu 0000:06:00.0: Unable to communicate with GPU because it is insufficiently powered.
This may be because not all required external power cables are
attached, or the attached cables are not seated properly.

Unable to determine the device handle for gpu 0000:09:00.0: Unable to communicate with GPU because it is insufficiently powered.
This may be because not all required external power cables are
attached, or the attached cables are not seated properly.

GPU 3: GeForce GTX TITAN X (UUID: GPU-c29f1e36-d438-f51a-b78d-5a47e928be72)

./deviceQuery
./deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 10
→ invalid device ordinal
Result = FAIL

The tool clearly states the issue that needs to be addressed.

(1) Make sure each GPU is firmly seated in its PCIe slot, and that brackets are secured with a screw or other relevant retaining device (e.g. locking bar).

(2) Make sure that all PCIe power connectors are hooked up for each GPU, and firmly pushed in. Usually the connectors have a small tab that audibly locks into place (click sound) when the power connector is pushed all the way into the socket on the GPU

(3) Do not use Y-splitters or 6-pin to 8-pin converters on any PCIe power supply cables

(4) Use a PSU (power supply unit) of sufficient wattage. The sum of the nominal wattage of all system components should not exceed 60% of the nominal wattage of the PSU. 80 PLUS Platinum rated PSUs are recommended.

This a nvidia devbox https://developer.nvidia.com/devbox. It’s never opened. It was working with pre-installed Ubuntu 14.04 and a very old version of Cuda. I upgraded/clean installed the OS to 16.04 and installed cuda libaries but what’s shown in nvidia-smi is wrong. Is there a specific setting needed for Multi-GPU configurations such as this for Cuda?

The link you gave in #3 results in a “page not found”. I assume a devbox is some pre-configured machine NVIDIA made available to certain developers, and that that was some while ago (since the link is dead). Although if the machine contains a Titan X it can’t be that old.

In my experience, there are no false positives for that particular error message from nvidia-smi. I cannot give you a 100% guarantee that an incorrect error message could not result from an improper software upgrade. Maybe an incorrect error message could be caused by installing the wrong driver package, or invoking an outdated nvidia-smi executable with a new driver package.

Does NVIDIA provide guidance on how to go about upgrading the system software for these pre-configured machines? Does NVIDIA support user-upgrades to devboxes?

Was the physical machine transported recently? Vibrations during transport can have a negative impact on connectors of all kinds.

It is also possible (though very unlikely) that damage to the PSU occurred while power cycling the system, e.g. because the machine was not properly shut down before external power was removed.

If this were my machine, I would systematically exclude potential sources of error, starting with the hardware configuration. Then proceed to checking the software configuration. E.g. is the installed driver package sufficient for the installed CUDA version; are drivers and CUDA version listed as supported for this version of Ubuntu; are paths to executables and libraries correct and consistent (avoiding a mix of of newly installed and previously installed software).

[Later:]

It seems you may be referring to a “DIGITS devbox”? NVIDIA seems to have a bunch up software updates available for that, did you make use of those? [url]https://docs.nvidia.com/deeplearning/digits-devbox-sw-release-notes/index.html[/url]. If NVIDIA sells (or sold) these machines directly to end users, I would expect there to be a dedicated support channel for them. Have you checked for relevant information? These CUDA forums are more of a “users helping users” platform, not an official contact point for NVIDIA support. I wonder whether these boxes might rely on a customized OS image, resulting in issues when a generic Ubuntu image is used.

The link had an extra period at the end.

DevBox is outdated by a generation already. And it’s overpriced by at least 100%, even at the time when the components were state of the art.

@JanetYellen So is there going to be rate hike this month?

The DIGITS DevBox may be somewhat outdated or overpriced, but anybody who owns one (like the OP, apparently) would understandably prefer it to be in working condition. Again, if this is an expensive piece of equipment sold directly by NVIDIA, I would expect a dedicated support channel to exist.

Even though you say this was due to a software update, I’d still take njuffa’s advice and open the workstation’s side panel and check the cables anyway. It’s the easiest thing to do, so don’t be scared taking off the panel and wiggling each of the PCIE plugs more firmly into each GPU, and wiggling each GPU’s own seating into its PCIE slot as well. While you’re in there you can use some compressed air to blow dust out of everything, especially heatsinks and fans. I mention this because you say you’ve never opened it (in years!) which likely means you’re long overdue for a dust-out.

Also, make sure the workstation is not plugged into a UPS. I had a terribly hard to diagnose problem years ago where my code was unrepoducably unstable. I tracked it down to my UPS which was not delivering nearly as much power as it claimed it could. My hardware and software were fine. I just had to plug directly into the wall.

@Njuffa: Sorry here is the correct link: [url]https://developer.nvidia.com/devbox[/url]

@JanetYellen: I opened the box and cleaned it up thoroughly, using pressure air. Reinstalled all the cards and power cables. Same result:

nvidia-smi -L
GPU 0: GeForce GTX TITAN X (UUID: GPU-26b3d94a-384c-e532-774e-4385fde429a0)
Unable to determine the device handle for gpu 0000:06:00.0: Unable to communicate with GPU because it is insufficiently powered.
This may be because not all required external power cables are
attached, or the attached cables are not seated properly.

Unable to determine the device handle for gpu 0000:09:00.0: Unable to communicate with GPU because it is insufficiently powered.
This may be because not all required external power cables are
attached, or the attached cables are not seated properly.

GPU 3: GeForce GTX TITAN X (UUID: GPU-c29f1e36-d438-f51a-b78d-5a47e928be72)

So that’s a pretty good indication that there is nothing wrong with the hardware, and you can now work through the checklist for the software I outlined above.

My approach would be to first restore the software back to its previous (working) state, then perform controlled experiments that change only one variable at any given time. Arguably, upgrading an OS is the exact opposite of that, as it literally changes hundreds of software components. It might make sense to investigate whether you can upgrade CUDA and the associated drivers, one version at the time, while keeping the original OS in place.

What version of CUDA is currently installed on this machine? After looking up the DIGITS devbox, I would think it can’t be older than 6.5? You would want to skip version 7.0 and try CUDA 7.5. If that works, try CUDA 8.0.

What specifically is the end goal here, that is, what features of newer software are required for your use case?

@Pourya Remove the side-panel and take a pic.