CUDA invalid device ordinal

I’ve installed CUDA and Caffe without errors. While running

make runtest

in caffe source dir, I keep getting

Check failed: error == cudaSuccess (10 vs. 0)  invalid device ordinal

error. After running ./deviceQuery in cuda directory, I got a very similar error:

cudaGetDeviceCount returned 10
-> invalid device ordinal

After even more googling, I ran nvidia-smi to get

Unable to determine the device handle for GPU 0000:04:00.0: Unable to communicate with GPU because it is insufficiently powered.
This may be because not all required external power cables are
attached, or the attached cables are not seated properly.

I contacted sysadmin and he told me all cables are properly connected, but te problem remained. I tried nvidia-debugdump -l to get

Found 2 NVIDIA devices
	Device ID:              0
	Device name:            NVS 315   (*PrimaryCard)
	GPU internal ID:        GPU-fc8b8a6f-c28f-9860-1469-453ea6a4abb0

Error: nvmlDeviceGetHandleByIndex(): Insufficient External Power
FAILED to get details on GPU (0x1): Insufficient External Power

and tried changing driver with update-alternatives --config x86_64-linux-gnu_gl_conf:

Selection    Path                                       Priority   Status
------------------------------------------------------------
* 0            /usr/lib/nvidia-352/ld.so.conf              8604      auto mode
  1            /usr/lib/nvidia-352-prime/ld.so.conf        8603      manual mode
  2            /usr/lib/nvidia-352/ld.so.conf              8604      manual mode
  3            /usr/lib/x86_64-linux-gnu/mesa/ld.so.conf   500       manual mode

Press enter to keep the current choice[*], or type selection number: 1
update-alternatives: using /usr/lib/nvidia-352-prime/ld.so.conf to provide /etc/ld.so.conf.d/x86_64-linux-gnu_GL.conf (x86_64-linux-gnu_gl_conf) in manual mode

but deviceQuery keeps returning the same error.

I’d appreciate any suggestions on the matter.

I have never seen a false positive for the error “Unable to communicate with GPU because it is insufficiently powered”. The sensor built into the GPU will reliably diagnose insufficient power conditions.

The NVS 315 seems to be a very low-end GPU with compute capability 2.1, which is still supported by the currently shipping CUDA version 8. It appears to have no auxiliary power connectors from the pictures I found on the internet, the following is general advice applicable to all GPUs.

The GPU draws power through the PCIe slot, up to 75W according to the PCIe specification. If it has auxiliary power connectors, it draws up to 75W through a strand with a 6-pin connector, and up to 150W through a strand with an 8-pin connector.

You would want to make sure that

(1) the power supply unit of the machine is sufficiently sized. Rule of thumb: The sum of the nominal wattage of all system components should be 50%-60% of the nominal wattage of the power supply. I would recommend use of “80 PLUS Platinum” rated PSUs for reliability and efficiency. This website provides a useful overview of available 80 PLUS PSUs: https://plugloadsolutions.com/80PlusPowerSupplies.aspx

(2) the GPU is properly seated in the PCIe slot, AND mechanically secured at the bracket (screw, latch, etc). Connectors can be negatively impacted by vibrations and mechanical strain, so physically securing the GPU is important for reliable operation.

(3) auxiliary power supply cables and connectors are undamaged, and connectors are properly plugged in at the GPU. Normally there is a little tab on the connector that snaps into place, holding the power connector securely even in systems with lots of vibration (e.g. from spinning hardware like hard disks or fans)

(4) 8-pin PCIe power connectors are NOT driven, via a 6-pin to 8-pin converter, from a 6-pin PCIe power strand of the PSU.

(5) if there are multiple PCIe power connectors on the GPU, they are NOT driven by a single PCIe power strand from the PSU by means of a Y-splitter.

Well apparently it did something! There were 2 cards on the server: NVS 315 and Tesla k40, which apparently caused some conflict. After removing Tesla nvidia-smi now produces

±-----------------------------------------------------+
| NVIDIA-SMI 352.63 Driver Version: 352.63 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 NVS 315 Off | 0000:03:00.0 N/A | N/A |
| 34% 50C P0 N/A / N/A | 3MiB / 1023MiB | N/A Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
±----------------------------------------------------------------------------+

The error though remains. Should I reinstall cuda and caffe?

All you did was remove the device that was incorrectly powered. So it’s not surprising that the incorrectly powered issue goes away if you do that.

It seems evident to me that your K40 was incorrectly powered.

Thanks, I’ve also done some googling, but do I need any special drivers for Tesla K40? Currently I have nvidia-352 installed on the server.

The driver that ships with the CUDA 7.5 or CUDA 8 is compatible with K40. Or you can download a driver from www.nvidia.com

THere is nothing wrong with an r352 driver for use with K40.

Thank you for posting this, it helped me ID what the issue was with one of the cards.