Release 18.09: 384.81 was detected and compatibility mode is UNAVAILABLE

sanchezvr7 · November 13, 2018, 5:27pm

Hi all, I have a machine with Tesla P4 and diver 384.81. It was working fine with the docker image nvcr.io/nvidia/tensorflow:18.09-py3 until yesterday, now the system is throwing the message “This container was built for NVIDIA Driver Release 410 or later, but version 384.81 was detected and compatibility mode is UNAVAILABLE.” and my tests are no longer working

================
== TensorFlow ==

NVIDIA Release 18.09 (build 687558)

Various files include modifications (c) NVIDIA CORPORATION. All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

ERROR: This container was built for NVIDIA Driver Release 410 or later, but
version 384.81 was detected and compatibility mode is UNAVAILABLE.

   [[CUDA Driver UNAVAILABLE (cuDevicePrimaryCtxRetain() returned 2)]]

Some recommendations?

NVES · November 13, 2018, 7:01pm

Hello,

I’m not able to repro. This is a TensorFlow issue, which maybe better addressed here: https://devtalk.nvidia.com/default/board/225/container-tensorflow/

================
== TensorFlow ==
================

NVIDIA Release 18.09 (build 687558)

Container image Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
Copyright 2017 The TensorFlow Authors.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

NOTE: Legacy NVIDIA Driver detected.  Compatibility mode ENABLED.

NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for TensorFlow.  NVIDIA recommends the use of the following flags:
   nvidia-docker run --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 ...

root@6512cf853b5b:/workspace# nvidia-smi
Tue Nov 13 18:59:22 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81                 Driver Version: 384.81                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-SXM2...  Off  | 00000000:06:00.0 Off |                    0 |
| N/A   33C    P0    32W / 300W |     10MiB / 16276MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-SXM2...  Off  | 00000000:07:00.0 Off |                    0 |
| N/A   33C    P0    32W / 300W |     10MiB / 16276MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P100-SXM2...  Off  | 00000000:0A:00.0 Off |                    0 |
| N/A   32C    P0    32W / 300W |     10MiB / 16276MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P100-SXM2...  Off  | 00000000:0B:00.0 Off |                    0 |
| N/A   28C    P0    29W / 300W |     10MiB / 16276MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla P100-SXM2...  Off  | 00000000:85:00.0 Off |                    0 |
| N/A   32C    P0    31W / 300W |     10MiB / 16276MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla P100-SXM2...  Off  | 00000000:86:00.0 Off |                    0 |
| N/A   29C    P0    32W / 300W |     10MiB / 16276MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla P100-SXM2...  Off  | 00000000:89:00.0 Off |                    0 |
| N/A   32C    P0    30W / 300W |     10MiB / 16276MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla P100-SXM2...  Off  | 00000000:8A:00.0 Off |                    0 |
| N/A   31C    P0    32W / 300W |     10MiB / 16276MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

NVES · November 13, 2018, 7:01pm

Moving from TensorRt to Container: TensorFlow for better coverage.

sanchezvr7 · November 13, 2018, 8:25pm

Thanks

Cliff_Woolley · November 13, 2018, 9:07pm

All of our 18.09 and later containers (TRT, TF, etc.) operate the same way in this particular respect.

In this case, you’re getting “cuDevicePrimaryCtxRetain() returned 2”, which is the following error code per CUDA Driver API :: CUDA Toolkit Documentation :

CUDA_ERROR_OUT_OF_MEMORY = 2
The API call failed because it was unable to allocate enough memory to perform the requested operation.

Does nvidia-smi show any other processes using that device at the same time? Does the issue still reproduce after you reboot?

sanchezvr7 · November 13, 2018, 9:28pm

hi Cliff, there were no other processes using the device at the same time. I have downloaded the newer image and it is working now on server 1:

nvcr.io/nvidia/tensorflow:18.10-py3

I have replicated the test on my second server which has also Tesla P4 and diver 384.81, but not working:

================
== TensorFlow ==
================

NVIDIA Release 18.10 (build 785222)

Container image Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
Copyright 2017 The TensorFlow Authors.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

ERROR: This container was built for NVIDIA Driver Release 410 or later, but
       version 384.81 was detected and compatibility mode is UNAVAILABLE.

       [[CUDA Driver UNAVAILABLE (timed out during init)]]

NOTE: Detected MOFED driver 4.3-1.0.1; attempting to automatically upgrade.

(Reading database ... 16727 files and directories currently installed.)
Preparing to unpack .../ibverbs-utils_41mlnx1-OFED.4.3.0.1.8.43101_amd64.deb ...
Unpacking ibverbs-utils (41mlnx1-OFED.4.3.0.1.8.43101) over (1.2.1mlnx1-OFED.4.0.0.1.3.40101) ...
Preparing to unpack .../libibverbs-dev_41mlnx1-OFED.4.3.0.1.8.43101_amd64.deb ...
Unpacking libibverbs-dev (41mlnx1-OFED.4.3.0.1.8.43101) over (1.2.1mlnx1-OFED.4.0.0.1.3.40101) ...
Preparing to unpack .../libibverbs1_41mlnx1-OFED.4.3.0.1.8.43101_amd64.deb ...
Unpacking libibverbs1 (41mlnx1-OFED.4.3.0.1.8.43101) over (1.2.1mlnx1-OFED.4.0.0.1.3.40101) ...
Preparing to unpack .../libmlx5-1_41mlnx1-OFED.4.3.0.2.1.43101_amd64.deb ...
Unpacking libmlx5-1 (41mlnx1-OFED.4.3.0.2.1.43101) over (1.2.1mlnx1-OFED.4.0.0.1.1.40101) ...
Setting up libibverbs1 (41mlnx1-OFED.4.3.0.1.8.43101) ...
Setting up libmlx5-1 (41mlnx1-OFED.4.3.0.2.1.43101) ...
Setting up ibverbs-utils (41mlnx1-OFED.4.3.0.1.8.43101) ...
Setting up libibverbs-dev (41mlnx1-OFED.4.3.0.1.8.43101) ...
Processing triggers for libc-bin (2.23-0ubuntu10) ...

NOTE: MOFED driver was detected, but nv_peer_mem driver was not detected.
      Multi-node communication performance may be reduced.

sanchezvr7 · November 14, 2018, 12:02am

well, it is odd. I have tried again with both images 18.09 and 18.10 (without any changes) and working now:

================
== TensorFlow ==
================

NVIDIA Release 18.09 (build 687558)

Container image Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
Copyright 2017 The TensorFlow Authors.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

NOTE: Legacy NVIDIA Driver detected.  Compatibility mode ENABLED.

================
== TensorFlow ==
================

NVIDIA Release 18.10 (build 785222)

Container image Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
Copyright 2017 The TensorFlow Authors.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

NOTE: Legacy NVIDIA Driver detected.  Compatibility mode ENABLED.

NOTE: Detected MOFED driver 4.3-1.0.1; attempting to automatically upgrade.

sanchezvr7 · November 16, 2018, 4:58pm

Hi all, here I am again. The container images 18.09 and 18.10 look very unstable with the nvidia driver 384.81. I am getting again the same incompatibility errors. See below:

nvidia-docker run -it --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/tensorflow:18.09-py3

================
== TensorFlow ==
================
NVIDIA Release 18.09 (build 687558)
Container image Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
Copyright 2017 The TensorFlow Authors.  All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

ERROR: This container was built for NVIDIA Driver Release 410 or later, but
       version 384.81 was detected and compatibility mode is UNAVAILABLE.

       [[CUDA Driver UNAVAILABLE (cuDevicePrimaryCtxRetain() returned 2)]]

nvidia-docker run -it --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/tensorflow:18.10-py

================
== TensorFlow ==
================
NVIDIA Release 18.10 (build 785222)
Container image Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
Copyright 2017 The TensorFlow Authors.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

ERROR: This container was built for NVIDIA Driver Release 410 or later, but
       version 384.81 was detected and compatibility mode is UNAVAILABLE.

       [[CUDA Driver UNAVAILABLE (cuDevicePrimaryCtxRetain() returned 2)]]

NOTE: Detected MOFED driver 4.3-1.0.1; attempting to automatically upgrade.

(Reading database ... 16727 files and directories currently installed.)
Preparing to unpack .../ibverbs-utils_41mlnx1-OFED.4.3.0.1.8.43101_amd64.deb ...
Unpacking ibverbs-utils (41mlnx1-OFED.4.3.0.1.8.43101) over (1.2.1mlnx1-OFED.4.0.0.1.3.40101) ...
Preparing to unpack .../libibverbs-dev_41mlnx1-OFED.4.3.0.1.8.43101_amd64.deb ...
Unpacking libibverbs-dev (41mlnx1-OFED.4.3.0.1.8.43101) over (1.2.1mlnx1-OFED.4.0.0.1.3.40101) ...
Preparing to unpack .../libibverbs1_41mlnx1-OFED.4.3.0.1.8.43101_amd64.deb ...
Unpacking libibverbs1 (41mlnx1-OFED.4.3.0.1.8.43101) over (1.2.1mlnx1-OFED.4.0.0.1.3.40101) ...
Preparing to unpack .../libmlx5-1_41mlnx1-OFED.4.3.0.2.1.43101_amd64.deb ...
Unpacking libmlx5-1 (41mlnx1-OFED.4.3.0.2.1.43101) over (1.2.1mlnx1-OFED.4.0.0.1.1.40101) ...
Setting up libibverbs1 (41mlnx1-OFED.4.3.0.1.8.43101) ...
Setting up libmlx5-1 (41mlnx1-OFED.4.3.0.2.1.43101) ...
Setting up ibverbs-utils (41mlnx1-OFED.4.3.0.1.8.43101) ...
Setting up libibverbs-dev (41mlnx1-OFED.4.3.0.1.8.43101) ...
Processing triggers for libc-bin (2.23-0ubuntu10) ...

NOTE: MOFED driver was detected, but nv_peer_mem driver was not detected.
      Multi-node communication performance may be reduced.

some recommendations?

Cliff_Woolley · November 16, 2018, 7:06pm

We do actually test and validate the compatibility mode; sorry to hear you’re having some challenges with it. Can you please show the output of nvidia-smi and sudo lsof /dev/nvidia* at the time you are reproducing the issue? I don’t yet have enough information to reproduce the issue, and I haven’t seen that symptom before on any of the several configs I’m using compatibility mode for.

Cliff_Woolley · November 16, 2018, 7:12pm

Incidentally, though, 384.81 was comparatively early in the R384 series (it’s the one that originally shipped alongside CUDA 9); there were several later 384.xx Tesla Recommended Driver releases. If you’re able to upgrade at all, I recommend at least going up to the latest R384; see https://nvidia.com/drivers – 384.145 is the most recent from that series (look under “beta and archived drivers”).

Even better would be to go with R410, but I realize that might be a bigger ask depending on your environment.

sanchezvr7 · November 16, 2018, 8:16pm

Hi Cliff, I can upgrade to R410; however, my understanding was that the container image 18.10 on P4 works only with the nvidia driver 384. See:
“Driver Requirements: Release 18.10 is based on CUDA 10, which requires NVIDIA Driver release 410.xx. However, if you are running on Tesla (Tesla V100, Tesla P4, Tesla P40, or Tesla P100), you can use NVIDIA driver release 384. For more information, see CUDA Compatibility and Upgrades.” https://docs.nvidia.com/deeplearning/dgx/tensorflow-release-notes/rel_18.10.html#rel_18.10

Cliff_Woolley · November 16, 2018, 8:25pm

Ah, okay, sorry for the confusion – that’s a grammatical ambiguity.

It meant to say that you may use R384 for those GPUs with this release, not that you must. By all means go ahead and roll forward to R410 if you’re able to do so.

Thanks!

Topic		Replies	Views
Release 18.09: 384.81 was detected and compatibility mode is UNAVAILABLE Frameworks tensorflow	1	703	November 14, 2018
all CUDA-capable devices are busy or unavailable. What is wrong? cuDNN	10	9604	October 12, 2021
kernel version 440.31.0 does not match DSO version 440.33.1 — cannot find working devices in this configuration Linux	4	21002	December 12, 2019
CUDA drivers insufficient Frameworks tensorflow	31	2483	October 12, 2021
bandwidthTest example throws cudaErrorCallRequiresNewerDriver error when launched via nv-nsight-cu-cli Nsight Compute linux , driver	17	1322	February 9, 2024
No CUDA-capable device is detected - yolov4 TAO Toolkit	10	144	August 16, 2024
not able to update Tesla P100 driver 384 to 418 Linux	119	5146	November 12, 2019
Tensorflow coredump no supported devices found for CUDA (Docker nvcr.io container), after reboot nvidia-smi can't find driver Linux cuda , tensorflow	2	2577	October 8, 2020
Installation on WSL2/Windows 11 problem - can't see GPU CUDA on Windows Subsystem for Linux	11	20350	January 15, 2025
No GPU availability through tensorflow Jetson AGX Orin cuda , tensorflow	18	3063	December 21, 2022

Release 18.09: 384.81 was detected and compatibility mode is UNAVAILABLE

================ == TensorFlow ==

Related topics

================
== TensorFlow ==