CUDA Error: invalid device ordinal, python: ./src/cuda.c:36: check_error: Assertion `0' failed

I’m trying to deploy an project on GCP with kubernetes cluster. I followed the step in to install the drivers in the 2xGPU node and it did work. See the output I get inside the container in the node:

(venv) root@frameprocessor:/opt/visualcortex/bin# nvidia-smi
Fri Feb 15 05:09:36 2019
| NVIDIA-SMI 390.48 Driver Version: 390.48 |
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| 0 Tesla P4 Off | 00000000:00:04.0 Off | 0 |
| N/A 32C P8 7W / 75W | 0MiB / 7611MiB | 0% Default |
| 1 Tesla P4 Off | 00000000:00:05.0 Off | 0 |
| N/A 30C P8 7W / 75W | 0MiB / 7611MiB | 0% Default |

| Processes: GPU Memory |
| GPU PID Type Process name Usage |
| No running processes found |
The program (utilise GPU, using Darknet,yolo and tenserflow) running inside the container threw the errors as below:

root@frameprocessor:/opt/visualcortex# source ~/miniconda/bin/activate venv && python /opt/visualcortex/bin/
2019-02-15 06:11:40.692718: I tensorflow/core/platform/] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-02-15 06:11:40.907127: I tensorflow/stream_executor/cuda/] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-02-15 06:11:40.908274: I tensorflow/core/common_runtime/gpu/] Found device 0 with properties:
name: Tesla P4 major: 6 minor: 1 memoryClockRate(GHz): 1.1135
pciBusID: 0000:00:04.0
totalMemory: 7.43GiB freeMemory: 7.31GiB
2019-02-15 06:11:40.909382: I tensorflow/core/common_runtime/gpu/] Adding visible gpu devices: 0
2019-02-15 06:11:41.328257: I tensorflow/core/common_runtime/gpu/] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-15 06:11:41.328940: I tensorflow/core/common_runtime/gpu/] 0
2019-02-15 06:11:41.329272: I tensorflow/core/common_runtime/gpu/] 0: N
2019-02-15 06:11:41.329867: I tensorflow/core/common_runtime/gpu/] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7053 MB memory) -> physical GPU (device: 0, name: Tesla P4, pci bus id: 0000:00:04.0, compute capability: 6.1)
CUDA Error: invalid device ordinal
python: ./src/cuda.c:36: check_error: Assertion `0’ failed.
Aborted (core dumped)
The drivers are well installed but why the program could not find them? Could you please help to figure out the issue?

Part of code:

import os
os.environ[“CUDA_DEVICE_ORDER”] = “PCI_BUS_ID”
os.environ[“CUDA_VISIBLE_DEVICES”] = “0,1”

Can you run deviceQuery from the cuda demos? If so, please post the output.

Sorry I didn’t find deviceQuery in the container, maybe because the drivers are installed by GKE?

If I misunderstand anything, please tell me.

Thanks mate.

deviceQuery is part of the cuda demo suite which is normally installed alongside the cuda toolkit so it should reside inside the container. The path should be like

Thanks generis.

I still cannot find deviceQuery in the path:

root@frameprocessor:/# cd usr/local
root@frameprocessor:/usr/local# ls
bin cuda cuda-9.0 etc games include lib man nvidia sbin share src
root@frameprocessor:/usr/local# cd cuda/extras/
root@frameprocessor:/usr/local/cuda/extras# ls
CUPTI Debugger
root@frameprocessor:/usr/local/cuda/extras# cd …/…/cuda-9.0/extras/
root@frameprocessor:/usr/local/cuda-9.0/extras# ls
CUPTI Debugger
root@frameprocessor:/usr/local/cuda-9.0/extras# cd CUPTI/sample/
root@frameprocessor:/usr/local/cuda-9.0/extras/CUPTI/sample# ls
activity_trace_async callback_event callback_metric callback_timestamp cupti_query event_multi_gpu event_sampling nvlink_bandwidth openacc_trace pc_sampling sass_source_map unified_memory
root@frameprocessor:/usr/local/cuda-9.0/extras/CUPTI/sample# cd …/…/Debugger/
root@frameprocessor:/usr/local/cuda-9.0/extras/Debugger# ls
Readme.txt include lib64

The driver was installed under kubernetes , instead of cuda toolkit.