CUDA invalid device ordinal

Alex1980 · October 25, 2016, 12:27pm

I’ve installed CUDA and Caffe without errors. While running

make runtest

in caffe source dir, I keep getting

Check failed: error == cudaSuccess (10 vs. 0)  invalid device ordinal

error. After running ./deviceQuery in cuda directory, I got a very similar error:

cudaGetDeviceCount returned 10
-> invalid device ordinal

After even more googling, I ran nvidia-smi to get

Unable to determine the device handle for GPU 0000:04:00.0: Unable to communicate with GPU because it is insufficiently powered.
This may be because not all required external power cables are
attached, or the attached cables are not seated properly.

I contacted sysadmin and he told me all cables are properly connected, but te problem remained. I tried nvidia-debugdump -l to get

Found 2 NVIDIA devices
	Device ID:              0
	Device name:            NVS 315   (*PrimaryCard)
	GPU internal ID:        GPU-fc8b8a6f-c28f-9860-1469-453ea6a4abb0

Error: nvmlDeviceGetHandleByIndex(): Insufficient External Power
FAILED to get details on GPU (0x1): Insufficient External Power

and tried changing driver with update-alternatives --config x86_64-linux-gnu_gl_conf:

Selection    Path                                       Priority   Status
------------------------------------------------------------
* 0            /usr/lib/nvidia-352/ld.so.conf              8604      auto mode
  1            /usr/lib/nvidia-352-prime/ld.so.conf        8603      manual mode
  2            /usr/lib/nvidia-352/ld.so.conf              8604      manual mode
  3            /usr/lib/x86_64-linux-gnu/mesa/ld.so.conf   500       manual mode

Press enter to keep the current choice[*], or type selection number: 1
update-alternatives: using /usr/lib/nvidia-352-prime/ld.so.conf to provide /etc/ld.so.conf.d/x86_64-linux-gnu_GL.conf (x86_64-linux-gnu_gl_conf) in manual mode

but deviceQuery keeps returning the same error.

I’d appreciate any suggestions on the matter.

njuffa · October 25, 2016, 6:04pm

I have never seen a false positive for the error “Unable to communicate with GPU because it is insufficiently powered”. The sensor built into the GPU will reliably diagnose insufficient power conditions.

The NVS 315 seems to be a very low-end GPU with compute capability 2.1, which is still supported by the currently shipping CUDA version 8. It appears to have no auxiliary power connectors from the pictures I found on the internet, the following is general advice applicable to all GPUs.

The GPU draws power through the PCIe slot, up to 75W according to the PCIe specification. If it has auxiliary power connectors, it draws up to 75W through a strand with a 6-pin connector, and up to 150W through a strand with an 8-pin connector.

You would want to make sure that

(1) the power supply unit of the machine is sufficiently sized. Rule of thumb: The sum of the nominal wattage of all system components should be 50%-60% of the nominal wattage of the power supply. I would recommend use of “80 PLUS Platinum” rated PSUs for reliability and efficiency. This website provides a useful overview of available 80 PLUS PSUs: [url]80 Plus Overview | CLEAResult

(2) the GPU is properly seated in the PCIe slot, AND mechanically secured at the bracket (screw, latch, etc). Connectors can be negatively impacted by vibrations and mechanical strain, so physically securing the GPU is important for reliable operation.

(3) auxiliary power supply cables and connectors are undamaged, and connectors are properly plugged in at the GPU. Normally there is a little tab on the connector that snaps into place, holding the power connector securely even in systems with lots of vibration (e.g. from spinning hardware like hard disks or fans)

(4) 8-pin PCIe power connectors are NOT driven, via a 6-pin to 8-pin converter, from a 6-pin PCIe power strand of the PSU.

(5) if there are multiple PCIe power connectors on the GPU, they are NOT driven by a single PCIe power strand from the PSU by means of a Y-splitter.

Alex1980 · October 26, 2016, 9:55am

Well apparently it did something! There were 2 cards on the server: NVS 315 and Tesla k40, which apparently caused some conflict. After removing Tesla nvidia-smi now produces

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
±----------------------------------------------------------------------------+

The error though remains. Should I reinstall cuda and caffe?

Robert_Crovella · October 26, 2016, 1:35pm

All you did was remove the device that was incorrectly powered. So it’s not surprising that the incorrectly powered issue goes away if you do that.

It seems evident to me that your K40 was incorrectly powered.

Alex1980 · October 26, 2016, 3:50pm

Thanks, I’ve also done some googling, but do I need any special drivers for Tesla K40? Currently I have nvidia-352 installed on the server.

Robert_Crovella · October 26, 2016, 6:15pm

The driver that ships with the CUDA 7.5 or CUDA 8 is compatible with K40. Or you can download a driver from www.nvidia.com

THere is nothing wrong with an r352 driver for use with K40.

cl11 · August 23, 2020, 6:59pm

Thank you for posting this, it helped me ID what the issue was with one of the cards.

Topic		Replies	Views
Help required in Invalid device ordinal CUDA Programming and Performance	6	8037	March 10, 2012
Nvidia DevBox CUDA 7.5 install only recognizes 3/4 Titan X graphics card CUDA Setup and Installation	3	970	March 7, 2017
invalid device ordinal CUDA Setup and Installation	14	8602	September 14, 2016
Power Supply Installation CUDA Programming and Performance	0	7633	June 19, 2007
No device supporting CUDA? CUDA Programming and Performance	18	20859	January 31, 2008
Unable to communicate with GPU because it is insufficiently powered. CUDA Setup and Installation	8	6538	July 29, 2019
Installation of two GPU cards? CUDA Programming and Performance	4	6390	August 30, 2010
Invalid device ordinal CUDA Programming and Performance	1	795	January 25, 2013
deviceQuery reports: cudaGetDeviceCount returned 10 -> invalid device ordinal / test results... F CUDA Programming and Performance	1	3595	July 2, 2013
Tesla C2070 device query error CUDA Setup and Installation	22	7207	August 12, 2014

CUDA invalid device ordinal

Related topics