CUDA Error: invalid device ordinal, python: ./src/cuda.c:36: check_error: Assertion `0' failed

feng.xu · February 15, 2019, 10:12am

I’m trying to deploy an project on GCP with kubernetes cluster. I followed the step in Running GPUs | Kubernetes Engine Documentation | Google Cloud to install the drivers in the 2xGPU node and it did work. See the output I get inside the container in the node:

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
The program (utilise GPU, using Darknet,yolo and tenserflow) running inside the container threw the errors as below:

root@frameprocessor:/opt/visualcortex# source ~/miniconda/bin/activate venv && python /opt/visualcortex/bin/run_vision.py
2019-02-15 06:11:40.692718: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-02-15 06:11:40.907127: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-02-15 06:11:40.908274: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Tesla P4 major: 6 minor: 1 memoryClockRate(GHz): 1.1135
pciBusID: 0000:00:04.0
totalMemory: 7.43GiB freeMemory: 7.31GiB
2019-02-15 06:11:40.909382: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-15 06:11:41.328257: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-15 06:11:41.328940: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-15 06:11:41.329272: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-15 06:11:41.329867: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7053 MB memory) → physical GPU (device: 0, name: Tesla P4, pci bus id: 0000:00:04.0, compute capability: 6.1)
CUDA Error: invalid device ordinal
python: ./src/cuda.c:36: check_error: Assertion `0’ failed.
Aborted (core dumped)
The drivers are well installed but why the program could not find them? Could you please help to figure out the issue?

Part of code:

import os
os.environ[“CUDA_DEVICE_ORDER”] = “PCI_BUS_ID”
os.environ[“CUDA_VISIBLE_DEVICES”] = “0,1”

generix · February 15, 2019, 4:08pm

Can you run deviceQuery from the cuda demos? If so, please post the output.

feng.xu · February 16, 2019, 4:31am

Sorry I didn’t find deviceQuery in the container, maybe because the drivers are installed by GKE?

If I misunderstand anything, please tell me.

Thanks mate.

generix · February 16, 2019, 3:17pm

deviceQuery is part of the cuda demo suite which is normally installed alongside the cuda toolkit so it should reside inside the container. The path should be like
//extras/demo_suite
[url]https://docs.nvidia.com/cuda/demo-suite/index.html[/url]

feng.xu · February 16, 2019, 11:41pm

Thanks generis.

I still cannot find deviceQuery in the path:

root@frameprocessor:/# cd usr/local
root@frameprocessor:/usr/local# ls
bin cuda cuda-9.0 etc games include lib man nvidia sbin share src
root@frameprocessor:/usr/local# cd cuda/extras/
root@frameprocessor:/usr/local/cuda/extras# ls
CUPTI Debugger
root@frameprocessor:/usr/local/cuda/extras# cd …/…/cuda-9.0/extras/
root@frameprocessor:/usr/local/cuda-9.0/extras# ls
CUPTI Debugger
root@frameprocessor:/usr/local/cuda-9.0/extras# cd CUPTI/sample/
root@frameprocessor:/usr/local/cuda-9.0/extras/CUPTI/sample# ls
activity_trace_async callback_event callback_metric callback_timestamp cupti_query event_multi_gpu event_sampling nvlink_bandwidth openacc_trace pc_sampling sass_source_map unified_memory
root@frameprocessor:/usr/local/cuda-9.0/extras/CUPTI/sample# cd …/…/Debugger/
root@frameprocessor:/usr/local/cuda-9.0/extras/Debugger# ls
Readme.txt include lib64

The driver was installed under kubernetes https://cloud.google.com/kubernetes-engine/docs/how-to/gpus , instead of cuda toolkit.

Topic		Replies	Views
invalid device ordinal (I can't find any help about this) CUDA Programming and Performance	7	19863	July 1, 2014
deviceQuery reports: cudaGetDeviceCount returned 10 -> invalid device ordinal / test results... F CUDA Programming and Performance	1	3632	July 2, 2013
Help required in Invalid device ordinal CUDA Programming and Performance	6	8128	March 10, 2012
cudaErrorInvalidDevice: invalid device ordinal CUDA Setup and Installation	0	397	April 18, 2024
Invalid device ordinal CUDA Programming and Performance	1	841	January 25, 2013
Invalid Device? CUDA Programming and Performance	0	948	March 28, 2013
"invalid device ordinal" on GTX 590 CUDA Programming and Performance	0	5448	August 12, 2011
deviceQuery - invalid device ordinal - Ubuntu 14.04 Server CUDA Setup and Installation	5	3033	March 2, 2015
Problems with CUDA CUDA Programming and Performance	8	2972	December 3, 2012
cudaError 10 ("invalid device ordinal") what does it mean CUDA Programming and Performance	1	9340	May 15, 2008

CUDA Error: invalid device ordinal, python: ./src/cuda.c:36: check_error: Assertion `0' failed

Related topics