Hi all, I have a machine with Tesla P4 and diver 384.81. It was working fine with the docker image nvcr.io/nvidia/tensorflow:18.09-py3 until yesterday, now the system is throwing the message “This container was built for NVIDIA Driver Release 410 or later, but version 384.81 was detected and compatibility mode is UNAVAILABLE.” and my tests are no longer working
================
== TensorFlow ==
NVIDIA Release 18.09 (build 687558)
Container image Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
Copyright 2017 The TensorFlow Authors. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION. All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.
ERROR: This container was built for NVIDIA Driver Release 410 or later, but
version 384.81 was detected and compatibility mode is UNAVAILABLE.
[[CUDA Driver UNAVAILABLE (cuDevicePrimaryCtxRetain() returned 2)]]
hi Cliff, there were no other processes using the device at the same time. I have downloaded the newer image and it is working now on server 1:
nvcr.io/nvidia/tensorflow:18.10-py3
I have replicated the test on my second server which has also Tesla P4 and diver 384.81, but not working:
================
== TensorFlow ==
================
NVIDIA Release 18.10 (build 785222)
Container image Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
Copyright 2017 The TensorFlow Authors. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION. All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.
ERROR: This container was built for NVIDIA Driver Release 410 or later, but
version 384.81 was detected and compatibility mode is UNAVAILABLE.
[[CUDA Driver UNAVAILABLE (timed out during init)]]
NOTE: Detected MOFED driver 4.3-1.0.1; attempting to automatically upgrade.
(Reading database ... 16727 files and directories currently installed.)
Preparing to unpack .../ibverbs-utils_41mlnx1-OFED.4.3.0.1.8.43101_amd64.deb ...
Unpacking ibverbs-utils (41mlnx1-OFED.4.3.0.1.8.43101) over (1.2.1mlnx1-OFED.4.0.0.1.3.40101) ...
Preparing to unpack .../libibverbs-dev_41mlnx1-OFED.4.3.0.1.8.43101_amd64.deb ...
Unpacking libibverbs-dev (41mlnx1-OFED.4.3.0.1.8.43101) over (1.2.1mlnx1-OFED.4.0.0.1.3.40101) ...
Preparing to unpack .../libibverbs1_41mlnx1-OFED.4.3.0.1.8.43101_amd64.deb ...
Unpacking libibverbs1 (41mlnx1-OFED.4.3.0.1.8.43101) over (1.2.1mlnx1-OFED.4.0.0.1.3.40101) ...
Preparing to unpack .../libmlx5-1_41mlnx1-OFED.4.3.0.2.1.43101_amd64.deb ...
Unpacking libmlx5-1 (41mlnx1-OFED.4.3.0.2.1.43101) over (1.2.1mlnx1-OFED.4.0.0.1.1.40101) ...
Setting up libibverbs1 (41mlnx1-OFED.4.3.0.1.8.43101) ...
Setting up libmlx5-1 (41mlnx1-OFED.4.3.0.2.1.43101) ...
Setting up ibverbs-utils (41mlnx1-OFED.4.3.0.1.8.43101) ...
Setting up libibverbs-dev (41mlnx1-OFED.4.3.0.1.8.43101) ...
Processing triggers for libc-bin (2.23-0ubuntu10) ...
NOTE: MOFED driver was detected, but nv_peer_mem driver was not detected.
Multi-node communication performance may be reduced.
well, it is odd. I have tried again with both images 18.09 and 18.10 (without any changes) and working now:
================
== TensorFlow ==
================
NVIDIA Release 18.09 (build 687558)
Container image Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
Copyright 2017 The TensorFlow Authors. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION. All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.
NOTE: Legacy NVIDIA Driver detected. Compatibility mode ENABLED.
================
== TensorFlow ==
================
NVIDIA Release 18.10 (build 785222)
Container image Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
Copyright 2017 The TensorFlow Authors. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION. All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.
NOTE: Legacy NVIDIA Driver detected. Compatibility mode ENABLED.
NOTE: Detected MOFED driver 4.3-1.0.1; attempting to automatically upgrade.
Hi all, here I am again. The container images 18.09 and 18.10 look very unstable with the nvidia driver 384.81. I am getting again the same incompatibility errors. See below:
================
== TensorFlow ==
================
NVIDIA Release 18.09 (build 687558)
Container image Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
Copyright 2017 The TensorFlow Authors. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION. All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.
ERROR: This container was built for NVIDIA Driver Release 410 or later, but
version 384.81 was detected and compatibility mode is UNAVAILABLE.
[[CUDA Driver UNAVAILABLE (cuDevicePrimaryCtxRetain() returned 2)]]
================
== TensorFlow ==
================
NVIDIA Release 18.10 (build 785222)
Container image Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
Copyright 2017 The TensorFlow Authors. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION. All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.
ERROR: This container was built for NVIDIA Driver Release 410 or later, but
version 384.81 was detected and compatibility mode is UNAVAILABLE.
[[CUDA Driver UNAVAILABLE (cuDevicePrimaryCtxRetain() returned 2)]]
NOTE: Detected MOFED driver 4.3-1.0.1; attempting to automatically upgrade.
(Reading database ... 16727 files and directories currently installed.)
Preparing to unpack .../ibverbs-utils_41mlnx1-OFED.4.3.0.1.8.43101_amd64.deb ...
Unpacking ibverbs-utils (41mlnx1-OFED.4.3.0.1.8.43101) over (1.2.1mlnx1-OFED.4.0.0.1.3.40101) ...
Preparing to unpack .../libibverbs-dev_41mlnx1-OFED.4.3.0.1.8.43101_amd64.deb ...
Unpacking libibverbs-dev (41mlnx1-OFED.4.3.0.1.8.43101) over (1.2.1mlnx1-OFED.4.0.0.1.3.40101) ...
Preparing to unpack .../libibverbs1_41mlnx1-OFED.4.3.0.1.8.43101_amd64.deb ...
Unpacking libibverbs1 (41mlnx1-OFED.4.3.0.1.8.43101) over (1.2.1mlnx1-OFED.4.0.0.1.3.40101) ...
Preparing to unpack .../libmlx5-1_41mlnx1-OFED.4.3.0.2.1.43101_amd64.deb ...
Unpacking libmlx5-1 (41mlnx1-OFED.4.3.0.2.1.43101) over (1.2.1mlnx1-OFED.4.0.0.1.1.40101) ...
Setting up libibverbs1 (41mlnx1-OFED.4.3.0.1.8.43101) ...
Setting up libmlx5-1 (41mlnx1-OFED.4.3.0.2.1.43101) ...
Setting up ibverbs-utils (41mlnx1-OFED.4.3.0.1.8.43101) ...
Setting up libibverbs-dev (41mlnx1-OFED.4.3.0.1.8.43101) ...
Processing triggers for libc-bin (2.23-0ubuntu10) ...
NOTE: MOFED driver was detected, but nv_peer_mem driver was not detected.
Multi-node communication performance may be reduced.
We do actually test and validate the compatibility mode; sorry to hear you’re having some challenges with it. Can you please show the output of nvidia-smi and sudo lsof /dev/nvidia* at the time you are reproducing the issue? I don’t yet have enough information to reproduce the issue, and I haven’t seen that symptom before on any of the several configs I’m using compatibility mode for.
Incidentally, though, 384.81 was comparatively early in the R384 series (it’s the one that originally shipped alongside CUDA 9); there were several later 384.xx Tesla Recommended Driver releases. If you’re able to upgrade at all, I recommend at least going up to the latest R384; see https://nvidia.com/drivers – 384.145 is the most recent from that series (look under “beta and archived drivers”).
Even better would be to go with R410, but I realize that might be a bigger ask depending on your environment.
Hi Cliff, I can upgrade to R410; however, my understanding was that the container image 18.10 on P4 works only with the nvidia driver 384. See:
“Driver Requirements: Release 18.10 is based on CUDA 10, which requires NVIDIA Driver release 410.xx. However, if you are running on Tesla (Tesla V100, Tesla P4, Tesla P40, or Tesla P100), you can use NVIDIA driver release 384. For more information, see CUDA Compatibility and Upgrades.” https://docs.nvidia.com/deeplearning/dgx/tensorflow-release-notes/rel_18.10.html#rel_18.10
Ah, okay, sorry for the confusion – that’s a grammatical ambiguity.
It meant to say that you may use R384 for those GPUs with this release, not that you must. By all means go ahead and roll forward to R410 if you’re able to do so.