Unable to run TensorFlow with vGPU

gogasca · February 23, 2020, 8:55am

Running ESXi 6.5sp3

ESXi: NVIDIA-GRID-vSphere-6.5-440.53-440.56-442.06
Created a new VM with Ubuntu 18.04
In VM I installed: NVIDIA-Linux-x86_64-440.56-grid.run

I can run in VM:

root@tfe-1:~# nvidia-smi
Sun Feb 23 08:39:28 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.56       Driver Version: 440.56       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID P4-4C          On   | 00000000:02:00.0 Off |                  N/A |
| N/A   N/A    P8    N/A /  N/A |    336MiB /  4096MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Now I try to run a Docker container on top of VM that contains CUDA/CuDNN and TensorFlow.

docker run --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -it --rm -v /tmp:/tmp nvcr.io/nvidia/tensorflow:19.12-tf1-py3

I get this Warning

================
== TensorFlow ==
================

NVIDIA Release 19.12-tf1 (build 9258376)
TensorFlow Version 1.15.0

Container image Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
Copyright 2017-2019 The TensorFlow Authors.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
   Use 'nvidia-docker run' to start this container; see
   https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker .

NOTE: MOFED driver for multi-node communication was not detected.
      Multi-node communication performance may be reduced.

I get this WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.
When I run TensorFlow I get:

Python 3.6.9 (default, Nov  7 2019, 10:44:02) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2020-02-23 08:38:41.963456: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2
>>> tf.test.gpu_device_name()
2020-02-23 08:38:54.074797: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2194840000 Hz
2020-02-23 08:38:54.075181: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5507dd0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-23 08:38:54.075213: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-02-23 08:38:54.077045: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-02-23 08:38:54.077082: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: UNKNOWN ERROR (303)
2020-02-23 08:38:54.077113: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] no NVIDIA GPU device is present: /dev/nvidia0 does not exist

Troubleshooting:

- Install `apt install nvidia-modprobe` in both VM and container - Inside container:

root@5a278668fe9c:/workspace# echo $LD_LIBRARY_PATH
/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
root@5a278668fe9c:/workspace# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:24:38_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89
root@5a278668fe9c:/usr# nvidia-smi
bash: nvidia-smi: command not found

pcar · March 5, 2020, 7:05pm

I’m having the same issue, and from what I have found this is because Docker is not running with the "nvidia" runtime, it is still running with the "runc" runtime. I am having issues figuring out what documentation is correct for getting the nvidia runtime installed, the various docs i’ve read seem to contradict each other regarding what versions of what need to be installed.

nvidia-smi
Thu Mar  5 13:56:07 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:0B:00.0 Off |                    0 |
| N/A   28C    P0    24W / 250W |      0MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

sudo docker info
...
 Server Version: 19.03.5
 Storage Driver: overlay2
  ...
 Runtimes: runc
 Default Runtime: runc

I’ll update as I find any useful info.

pcar · March 9, 2020, 7:59pm

Got it working! basically needed to get the runtime installed and edit the Docker daemon.json to use the nvidia runtime.

{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "100m",
    "max-file": "2"
  },
  "default-runtime": "nvidia",
  "runtimes": {
  "nvidia": {
  "path": "/usr/bin/nvidia-container-runtime",
  "runtimeArgs": []
        }
  },
  "storage-driver": "overlay2"
}