Running ESXi 6.5sp3
ESXi: NVIDIA-GRID-vSphere-6.5-440.53-440.56-442.06
Created a new VM with Ubuntu 18.04
In VM I installed: NVIDIA-Linux-x86_64-440.56-grid.run
I can run in VM:
root@tfe-1:~# nvidia-smi
Sun Feb 23 08:39:28 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.56 Driver Version: 440.56 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GRID P4-4C On | 00000000:02:00.0 Off | N/A |
| N/A N/A P8 N/A / N/A | 336MiB / 4096MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Now I try to run a Docker container on top of VM that contains CUDA/CuDNN and TensorFlow.
docker run --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -it --rm -v /tmp:/tmp nvcr.io/nvidia/tensorflow:19.12-tf1-py3
I get this Warning
================
== TensorFlow ==
================
NVIDIA Release 19.12-tf1 (build 9258376)
TensorFlow Version 1.15.0
Container image Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
Copyright 2017-2019 The TensorFlow Authors. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION. All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.
WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.
Use 'nvidia-docker run' to start this container; see
https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker .
NOTE: MOFED driver for multi-node communication was not detected.
Multi-node communication performance may be reduced.
I get this WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.
When I run TensorFlow I get:
Python 3.6.9 (default, Nov 7 2019, 10:44:02)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2020-02-23 08:38:41.963456: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2
>>> tf.test.gpu_device_name()
2020-02-23 08:38:54.074797: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2194840000 Hz
2020-02-23 08:38:54.075181: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5507dd0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-23 08:38:54.075213: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-02-23 08:38:54.077045: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-02-23 08:38:54.077082: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: UNKNOWN ERROR (303)
2020-02-23 08:38:54.077113: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] no NVIDIA GPU device is present: /dev/nvidia0 does not exist
Troubleshooting:
-
- Install `apt install nvidia-modprobe` in both VM and container
- Inside container:
root@5a278668fe9c:/workspace# echo $LD_LIBRARY_PATH
/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
root@5a278668fe9c:/workspace# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:24:38_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89
root@5a278668fe9c:/usr# nvidia-smi
bash: nvidia-smi: command not found