Release 18.09: 384.81 was detected and compatibility mode is UNAVAILABLE

Hi all, I have a machine with Tesla P4 and diver 384.81. It was working fine with the docker image nvcr.io/nvidia/tensorflow:18.09-py3 until yesterday, now the system is throwing the message “This container was built for NVIDIA Driver Release 410 or later, but version 384.81 was detected and compatibility mode is UNAVAILABLE.” and my tests are no longer working

================
== TensorFlow ==

NVIDIA Release 18.09 (build 687558)

Container image Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
Copyright 2017 The TensorFlow Authors. All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION. All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

ERROR: This container was built for NVIDIA Driver Release 410 or later, but
version 384.81 was detected and compatibility mode is UNAVAILABLE.

   [[CUDA Driver UNAVAILABLE (cuDevicePrimaryCtxRetain() returned 2)]]

Some recommendations?

Hello,

I’m not able to repro. This is a TensorFlow issue, which maybe better addressed here: https://devtalk.nvidia.com/default/board/225/container-tensorflow/

================
== TensorFlow ==
================

NVIDIA Release 18.09 (build 687558)

Container image Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
Copyright 2017 The TensorFlow Authors.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

NOTE: Legacy NVIDIA Driver detected.  Compatibility mode ENABLED.

NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for TensorFlow.  NVIDIA recommends the use of the following flags:
   nvidia-docker run --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 ...

root@6512cf853b5b:/workspace# nvidia-smi
Tue Nov 13 18:59:22 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81                 Driver Version: 384.81                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-SXM2...  Off  | 00000000:06:00.0 Off |                    0 |
| N/A   33C    P0    32W / 300W |     10MiB / 16276MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-SXM2...  Off  | 00000000:07:00.0 Off |                    0 |
| N/A   33C    P0    32W / 300W |     10MiB / 16276MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P100-SXM2...  Off  | 00000000:0A:00.0 Off |                    0 |
| N/A   32C    P0    32W / 300W |     10MiB / 16276MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P100-SXM2...  Off  | 00000000:0B:00.0 Off |                    0 |
| N/A   28C    P0    29W / 300W |     10MiB / 16276MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla P100-SXM2...  Off  | 00000000:85:00.0 Off |                    0 |
| N/A   32C    P0    31W / 300W |     10MiB / 16276MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla P100-SXM2...  Off  | 00000000:86:00.0 Off |                    0 |
| N/A   29C    P0    32W / 300W |     10MiB / 16276MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla P100-SXM2...  Off  | 00000000:89:00.0 Off |                    0 |
| N/A   32C    P0    30W / 300W |     10MiB / 16276MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla P100-SXM2...  Off  | 00000000:8A:00.0 Off |                    0 |
| N/A   31C    P0    32W / 300W |     10MiB / 16276MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Moving from TensorRt to Container: TensorFlow for better coverage.

Thanks

All of our 18.09 and later containers (TRT, TF, etc.) operate the same way in this particular respect.

In this case, you’re getting “cuDevicePrimaryCtxRetain() returned 2”, which is the following error code per CUDA Driver API :: CUDA Toolkit Documentation :

CUDA_ERROR_OUT_OF_MEMORY = 2
The API call failed because it was unable to allocate enough memory to perform the requested operation.

Does nvidia-smi show any other processes using that device at the same time? Does the issue still reproduce after you reboot?

hi Cliff, there were no other processes using the device at the same time. I have downloaded the newer image and it is working now on server 1:

nvcr.io/nvidia/tensorflow:18.10-py3

I have replicated the test on my second server which has also Tesla P4 and diver 384.81, but not working:

================
== TensorFlow ==
================

NVIDIA Release 18.10 (build 785222)

Container image Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
Copyright 2017 The TensorFlow Authors.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

ERROR: This container was built for NVIDIA Driver Release 410 or later, but
       version 384.81 was detected and compatibility mode is UNAVAILABLE.

       [[CUDA Driver UNAVAILABLE (timed out during init)]]

NOTE: Detected MOFED driver 4.3-1.0.1; attempting to automatically upgrade.

(Reading database ... 16727 files and directories currently installed.)
Preparing to unpack .../ibverbs-utils_41mlnx1-OFED.4.3.0.1.8.43101_amd64.deb ...
Unpacking ibverbs-utils (41mlnx1-OFED.4.3.0.1.8.43101) over (1.2.1mlnx1-OFED.4.0.0.1.3.40101) ...
Preparing to unpack .../libibverbs-dev_41mlnx1-OFED.4.3.0.1.8.43101_amd64.deb ...
Unpacking libibverbs-dev (41mlnx1-OFED.4.3.0.1.8.43101) over (1.2.1mlnx1-OFED.4.0.0.1.3.40101) ...
Preparing to unpack .../libibverbs1_41mlnx1-OFED.4.3.0.1.8.43101_amd64.deb ...
Unpacking libibverbs1 (41mlnx1-OFED.4.3.0.1.8.43101) over (1.2.1mlnx1-OFED.4.0.0.1.3.40101) ...
Preparing to unpack .../libmlx5-1_41mlnx1-OFED.4.3.0.2.1.43101_amd64.deb ...
Unpacking libmlx5-1 (41mlnx1-OFED.4.3.0.2.1.43101) over (1.2.1mlnx1-OFED.4.0.0.1.1.40101) ...
Setting up libibverbs1 (41mlnx1-OFED.4.3.0.1.8.43101) ...
Setting up libmlx5-1 (41mlnx1-OFED.4.3.0.2.1.43101) ...
Setting up ibverbs-utils (41mlnx1-OFED.4.3.0.1.8.43101) ...
Setting up libibverbs-dev (41mlnx1-OFED.4.3.0.1.8.43101) ...
Processing triggers for libc-bin (2.23-0ubuntu10) ...

NOTE: MOFED driver was detected, but nv_peer_mem driver was not detected.
      Multi-node communication performance may be reduced.

well, it is odd. I have tried again with both images 18.09 and 18.10 (without any changes) and working now:

================
== TensorFlow ==
================

NVIDIA Release 18.09 (build 687558)

Container image Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
Copyright 2017 The TensorFlow Authors.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

NOTE: Legacy NVIDIA Driver detected.  Compatibility mode ENABLED.
================
== TensorFlow ==
================

NVIDIA Release 18.10 (build 785222)

Container image Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
Copyright 2017 The TensorFlow Authors.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

NOTE: Legacy NVIDIA Driver detected.  Compatibility mode ENABLED.

NOTE: Detected MOFED driver 4.3-1.0.1; attempting to automatically upgrade.

Hi all, here I am again. The container images 18.09 and 18.10 look very unstable with the nvidia driver 384.81. I am getting again the same incompatibility errors. See below:

nvidia-docker run -it --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/tensorflow:18.09-py3

================
== TensorFlow ==
================
NVIDIA Release 18.09 (build 687558)
Container image Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
Copyright 2017 The TensorFlow Authors.  All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

ERROR: This container was built for NVIDIA Driver Release 410 or later, but
       version 384.81 was detected and compatibility mode is UNAVAILABLE.

       [[CUDA Driver UNAVAILABLE (cuDevicePrimaryCtxRetain() returned 2)]]

nvidia-docker run -it --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/tensorflow:18.10-py

================
== TensorFlow ==
================
NVIDIA Release 18.10 (build 785222)
Container image Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
Copyright 2017 The TensorFlow Authors.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

ERROR: This container was built for NVIDIA Driver Release 410 or later, but
       version 384.81 was detected and compatibility mode is UNAVAILABLE.

       [[CUDA Driver UNAVAILABLE (cuDevicePrimaryCtxRetain() returned 2)]]

NOTE: Detected MOFED driver 4.3-1.0.1; attempting to automatically upgrade.

(Reading database ... 16727 files and directories currently installed.)
Preparing to unpack .../ibverbs-utils_41mlnx1-OFED.4.3.0.1.8.43101_amd64.deb ...
Unpacking ibverbs-utils (41mlnx1-OFED.4.3.0.1.8.43101) over (1.2.1mlnx1-OFED.4.0.0.1.3.40101) ...
Preparing to unpack .../libibverbs-dev_41mlnx1-OFED.4.3.0.1.8.43101_amd64.deb ...
Unpacking libibverbs-dev (41mlnx1-OFED.4.3.0.1.8.43101) over (1.2.1mlnx1-OFED.4.0.0.1.3.40101) ...
Preparing to unpack .../libibverbs1_41mlnx1-OFED.4.3.0.1.8.43101_amd64.deb ...
Unpacking libibverbs1 (41mlnx1-OFED.4.3.0.1.8.43101) over (1.2.1mlnx1-OFED.4.0.0.1.3.40101) ...
Preparing to unpack .../libmlx5-1_41mlnx1-OFED.4.3.0.2.1.43101_amd64.deb ...
Unpacking libmlx5-1 (41mlnx1-OFED.4.3.0.2.1.43101) over (1.2.1mlnx1-OFED.4.0.0.1.1.40101) ...
Setting up libibverbs1 (41mlnx1-OFED.4.3.0.1.8.43101) ...
Setting up libmlx5-1 (41mlnx1-OFED.4.3.0.2.1.43101) ...
Setting up ibverbs-utils (41mlnx1-OFED.4.3.0.1.8.43101) ...
Setting up libibverbs-dev (41mlnx1-OFED.4.3.0.1.8.43101) ...
Processing triggers for libc-bin (2.23-0ubuntu10) ...

NOTE: MOFED driver was detected, but nv_peer_mem driver was not detected.
      Multi-node communication performance may be reduced.

some recommendations?

We do actually test and validate the compatibility mode; sorry to hear you’re having some challenges with it. Can you please show the output of nvidia-smi and sudo lsof /dev/nvidia* at the time you are reproducing the issue? I don’t yet have enough information to reproduce the issue, and I haven’t seen that symptom before on any of the several configs I’m using compatibility mode for.

Incidentally, though, 384.81 was comparatively early in the R384 series (it’s the one that originally shipped alongside CUDA 9); there were several later 384.xx Tesla Recommended Driver releases. If you’re able to upgrade at all, I recommend at least going up to the latest R384; see https://nvidia.com/drivers – 384.145 is the most recent from that series (look under “beta and archived drivers”).

Even better would be to go with R410, but I realize that might be a bigger ask depending on your environment.

Hi Cliff, I can upgrade to R410; however, my understanding was that the container image 18.10 on P4 works only with the nvidia driver 384. See:
“Driver Requirements: Release 18.10 is based on CUDA 10, which requires NVIDIA Driver release 410.xx. However, if you are running on Tesla (Tesla V100, Tesla P4, Tesla P40, or Tesla P100), you can use NVIDIA driver release 384. For more information, see CUDA Compatibility and Upgrades.” https://docs.nvidia.com/deeplearning/dgx/tensorflow-release-notes/rel_18.10.html#rel_18.10

Ah, okay, sorry for the confusion – that’s a grammatical ambiguity.

It meant to say that you may use R384 for those GPUs with this release, not that you must. By all means go ahead and roll forward to R410 if you’re able to do so.

Thanks!