Tensorflow coredump no supported devices found for CUDA (Docker nvcr.io container), after reboot nvidia-smi can't find driver

I’m attempting to run tensorflow inside an nvidia container with GPU support on a VM with a virtual GPU. Nobody else is using the hardware where this VM is instantiated.
Nvidia-smi works and nvcc works but when attempting to call tf.Session() or tf.test.is_gpu_available() i get a core dump

I’ve had the same issue with a number of containers, including the latest from tensorflow, and from huggingface. The latest from tensorflow used to be able to compute:
tf.reduce_sum(tf.random.normal([1000, 1000]))
but this fails now as well, but I only noticed as I was gathering data for this post.

I’m also having another issue, where I provision my VM with a P100-xxC series vGPU, nvidia-smi won’t work on the guest OS. Not sure if this is related, i was intending to debug that issue after clearing this one up.

I realized that I hadn’t rebooted in a while and decided to make sure that didn’t magically fix my issue before posting, so I did and now nvidia-smi doesn’t work on the guest OS.
nvidia-smi

NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

I haven’t made any software changes except pulling new containers, and the tf.reduce_sum computation was working fine and finding a gpu at that point.

I’m still going to include all the logs from when nvidia-smi was working on the guest OS.

vGPU Drivers:
root@c7fc168f6ce1:/workspace# nvidia-smi

Fri Sep 25 16:12:42 2020
±----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05 Driver Version: 450.51.05 CUDA Version: 11.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GRID P100-16Q On | 00000000:02:02.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 1168MiB / 16384MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

Cuda:
root@c7fc168f6ce1:/workspace# nvcc --version

nvcc: NVIDIA ® Cuda compiler driver
Copyright © 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0

Docker command:
docker run --gpus all --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -it --rm nvcr.io/nvidia/tensorflow:20.08-tf1-py3

Python 3.6.9 (default, Jul 17 2020, 12:50:27)
[GCC 8.4.0] on linux
Type “help”, “copyright”, “credits” or “license” for more information.

import tensorflow as tf
2020-09-25 16:10:37.080467: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.

tf.test.is_gpu_available()
2020-09-25 16:10:48.070977: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2394370000 Hz
2020-09-25 16:10:48.071592: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5e54430 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-09-25 16:10:48.071622: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-09-25 16:10:48.074362: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-09-25 16:10:48.081885: W tensorflow/compiler/xla/service/platform_util.cc:190] unable to create StreamExecutor for CUDA:0: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_NOT_SUPPORTED: operation not supported
2020-09-25 16:10:48.082063: F tensorflow/stream_executor/lib/statusor.cc:34] Attempting to fetch value instead of handling error Internal: no supported devices found for platform CUDA
Aborted (core dumped)

docker run --gpus all -it --rm tensorflow/tensorflow:latest-gpu python -c “import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))”

2020-09-25 17:30:59.481598: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-09-25 17:31:01.243635: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-09-25 17:31:01.250599: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-25 17:31:01.251536: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:02:02.0 name: GRID P100-16Q computeCapability: 6.0
coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 16.00GiB deviceMemoryBandwidth: 681.88GiB/s
2020-09-25 17:31:01.251573: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-09-25 17:31:01.253281: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-09-25 17:31:01.255027: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-09-25 17:31:01.255433: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-09-25 17:31:01.257255: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-09-25 17:31:01.258261: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-09-25 17:31:01.262143: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-09-25 17:31:01.262269: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-25 17:31:01.263194: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-25 17:31:01.263765: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-09-25 17:31:01.264080: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-09-25 17:31:01.271054: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2394370000 Hz
2020-09-25 17:31:01.271552: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5b345a0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-09-25 17:31:01.271582: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-09-25 17:31:01.293640: W tensorflow/compiler/xla/service/platform_util.cc:210] unable to create StreamExecutor for CUDA:0: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_NOT_SUPPORTED: operation not supported
2020-09-25 17:31:01.303709: I tensorflow/compiler/jit/xla_gpu_device.cc:161] Ignoring visible XLA_GPU_JIT device. Device number is 0, reason: Internal: no supported devices found for platform CUDA
2020-09-25 17:31:01.304233: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-25 17:31:01.305687: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:02:02.0 name: GRID P100-16Q computeCapability: 6.0
coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 16.00GiB deviceMemoryBandwidth: 681.88GiB/s
2020-09-25 17:31:01.305796: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-09-25 17:31:01.305855: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-09-25 17:31:01.305890: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-09-25 17:31:01.305922: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-09-25 17:31:01.305955: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-09-25 17:31:01.305983: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-09-25 17:31:01.306017: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-09-25 17:31:01.306205: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-25 17:31:01.307655: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-25 17:31:01.308912: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-09-25 17:31:01.309020: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Traceback (most recent call last):
File “”, line 1, in
File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py”, line 201, in wrapper
return target(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/random_ops.py”, line 89, in random_normal
shape_tensor = tensor_util.shape_tensor(shape)
File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/tensor_util.py”, line 1029, in shape_tensor
return ops.convert_to_tensor(shape, dtype=dtype, name=“shape”)
File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py”, line 1499, in convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/constant_op.py”, line 338, in _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)
File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/constant_op.py”, line 264, in constant
allow_broadcast=True)
File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/constant_op.py”, line 275, in _constant_impl
return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/constant_op.py”, line 300, in _constant_eager_impl
t = convert_to_eager_tensor(value, ctx, dtype)
File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/constant_op.py”, line 97, in convert_to_eager_tensor
ctx.ensure_initialized()
File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/context.py”, line 539, in ensure_initialized
context_handle = pywrap_tfe.TFE_NewContext(opts)
tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: all CUDA-capable devices are busy or unavailable

Uploading bug report, from after driver could no longer be found

nvidia-bug-report.log.gz (60.0 KB) nvidia-bug-report.log.gz (60.0 KB)

My issues were solved by disabling the nouveau drivers, and reinstalling vGPU driver

echo ‘blacklist nouveau’ | sudo tee -a /etc/modprobe.d/blacklist.conf