Running TLT 3.0 in DGX A100, driver-version error

Hi, I am trying to train detectnet_v2 in TLT 3.0 in a DGX of A100-SXM4-40GB with MIG (multi instance GPUs)

NVIDIA-DRIVER : 450.80.02

I am using a docker container to run TLT:

docker run -d --name tlt-leo -it --rm --runtime nvidia --gpus 'device=6:1' -p 4444:4444 -v "/home/levera/tlt3_experiments/tlt_cv_samples_v1.1.0/":"/workspace" -v /var/run/docker.sock:/var/run/docker.sock --env NVIDIA_REQUIRE_CUDA='cuda>=11.1' --env NVIDIA_DRIVER_CAPABILITIES="all" nvcr.io/nvidia/tlt-streamanalytics:v3.0-py3 bash

Then:
docker exec -it tlt-leo bash

Inside of this container I am running all experiments.

I follow run all cell in the cells in the notebook but when I try to generate tf-record, it returns this error:

Converting Tfrecords for kitti trainval dataset
2021-07-13 21:22:38,973 [INFO] root: Registry: [‘nvcr.io’]
2021-07-13 21:22:39,459 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the “user”:“UID:GID” in the
DockerOptions portion of the ~/.tlt_mounts.json file. You can obtain your
users UID and GID by using the “id -u” and “id -g” commands on the
terminal.
Docker instantiation failed with error: 500 Server Error: Internal Server Error (“OCI runtime create failed: container_linux.go:349: starting container process caused “process_linux.go:449: container init caused “process_linux.go:432: running prestart hook 0 caused \“error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: requirement error: unsatisfied condition: cuda>=11.1, please update your driver to a newer version, or use an earlier cuda container\\n\”””: unknown”)

When I did nvcc -V:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0

I also tried,

docker run -d --name tlt-leo -it --rm --runtime nvidia --gpus all -p 4444:4444 -v "/home/levera/tlt3_experiments/tlt_cv_samples_v1.1.0/":"/workspace" -v /var/run/docker.sock:/var/run/docker.sock --env NVIDIA_REQUIRE_CUDA='cuda>=11.1' --env NVIDIA_DRIVER_CAPABILITIES="all" --env NVIDIA_REQUIRE_DRIVER="driver>=455" nvcr.io/nvidia/tlt-streamanalytics:v3.0-py3 bash

docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused “process_linux.go:449: container init caused “process_linux.go:432: running prestart hook 0 caused \“error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: requirement error: unsatisfied condition: driver>=455\\n\”””: unknown

So, Upgrading the version of nvidia-driver to >=455 of the DGX is the only one solution? I dont understand why the first error mention cuda version.

At this moment , I cant upgrade nvidia-driver version, do you have another solution?

Thanks in advance.

Hi @leo2105 ,

This is exactly the kind of thing our NVIDIA Enterprise Support team can help with! (See About the DGX User Forum / Note: this is not NVIDIA Enterprise Support for how to get them engaged)

In the meantime, to get the container launching, I think you need to override the required CUDA version. When you were adding --env NVIDIA_REQUIRE_CUDA="cuda>=11.1" this was the same as what the container already needed, and why it’s failing. You can cause it to launch by changing that to the CUDA version that matches what’s on the system (CUDA 11.0). For example:

dgxuser@DGX:~$ sudo docker run -it --rm --gpus all -p 4444:4444 --env NVIDIA_REQUIRE_CUDA="cuda>=11.0" nvcr.io/nvidia/tlt-streamanalytics:v3.0-py3 bash
--2021-07-13 23:37:08--  https://ngc.nvidia.com/downloads/ngccli_reg_linux.zip
Resolving ngc.nvidia.com (ngc.nvidia.com)... 13.35.125.50, 13.35.125.106, 13.35.125.35, ...
Connecting to ngc.nvidia.com (ngc.nvidia.com)|13.35.125.50|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25118063 (24M) [application/zip]
Saving to: ‘/opt/ngccli/ngccli_reg_linux.zip’

ngccli_reg_linux.zip             100%[==========================================================>]  23.95M  95.1MB/s    in 0.3s

2021-07-13 23:37:08 (95.1 MB/s) - ‘/opt/ngccli/ngccli_reg_linux.zip’ saved [25118063/25118063]

Archive:  /opt/ngccli/ngccli_reg_linux.zip
  inflating: /opt/ngccli/ngc
 extracting: /opt/ngccli/ngc.md5
root@b0430c2e51ac:/workspace#

(Add back in your volume mounts, etc as needed of course. Note that you shouldn’t need to pass the runtime, socket, and other arguments to start the container, unless you have other reasons for doing so.)

That container appears to already have the CUDA compatibility bits installed (see CUDA Compatibility :: GPU Deployment and Management Documentation), so I think you should be good to start running TLT.

On my system, inside that container CUDA 11.1 applications work:

root@b0430c2e51ac:/workspace# apt install cuda-samples-11-1
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following NEW packages will be installed:
  cuda-samples-11-1
0 upgraded, 1 newly installed, 0 to remove and 84 not upgraded.
Need to get 69.4 MB of archives.
After this operation, 237 MB of additional disk space will be used.
Get:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  cuda-samples-11-1 11.1.105-1 [69.4 MB]
Fetched 69.4 MB in 1s (67.9 MB/s)
debconf: delaying package configuration, since apt-utils is not installed
Selecting previously unselected package cuda-samples-11-1.
(Reading database ... 26880 files and directories currently installed.)
Preparing to unpack .../cuda-samples-11-1_11.1.105-1_amd64.deb ...
Unpacking cuda-samples-11-1 (11.1.105-1) ...
Setting up cuda-samples-11-1 (11.1.105-1) ...
root@b0430c2e51ac:/workspace# cd /usr/local/cuda/samples/1_Utilities/
root@b0430c2e51ac:/usr/local/cuda/samples/1_Utilities# ls
UnifiedMemoryPerf  bandwidthTest  deviceQuery  deviceQueryDrv  p2pBandwidthLatencyTest  topologyQuery
root@b0430c2e51ac:/usr/local/cuda/samples/1_Utilities# cd bandwidthTest/
root@b0430c2e51ac:/usr/local/cuda/samples/1_Utilities/bandwidthTest# make
/usr/local/cuda-11.1/bin/nvcc -ccbin g++ -I../../common/inc  -m64    -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_86,code=compute_86 -o bandwidthTest.o -c bandwidthTest.cu
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
/usr/local/cuda-11.1/bin/nvcc -ccbin g++   -m64      -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_86,code=compute_86 -o bandwidthTest bandwidthTest.o
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
mkdir -p ../../bin/x86_64/linux/release
cp bandwidthTest ../../bin/x86_64/linux/release
root@b0430c2e51ac:/usr/local/cuda/samples/1_Utilities/bandwidthTest# ./bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Tesla P100-SXM2-16GB
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(GB/s)
   32000000                     11.8

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(GB/s)
   32000000                     13.1

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(GB/s)
   32000000                     500.4

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
root@b0430c2e51ac:/usr/local/cuda/samples/1_Utilities/bandwidthTest#

Hope that helps!

ScottE

Hi @ScottEllis , Thanks for the answer, I got success running the cuda samples using this command.

docker run -d --name tlt-leo -it --rm --gpus all -p 4444:4444 -v "/home/levera/data":"/workspace/data" -v "/home/levera/tlt3_experiments/tlt_cv_samples_v1.1.0/":"/workspace" -v /var/run/docker.sock:/var/run/docker.sock --env NVIDIA_REQUIRE_CUDA="cuda>=11.0" nvcr.io/nvidia/tlt-streamanalytics:v3.0-py3 bash

I had to pass docker path, otherwise tlt3.0 can’t use docker inside the container. But, I still have the same problem when I try to create generate TF-Records

!tlt detectnet_v2 dataset_convert \
                  -d $SPECS_DIR/detectnet_v2_tfrecords_kitti_trainval.txt \
                  -o $DATA_DOWNLOAD_DIR/tfrecords/kitti_trainval/kitti_trainval

Converting Tfrecords for kitti trainval dataset
2021-07-14 14:43:52,242 [INFO] root: Registry: [‘nvcr.io’]
2021-07-14 14:43:52,832 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the “user”:“UID:GID” in the
DockerOptions portion of the ~/.tlt_mounts.json file. You can obtain your
users UID and GID by using the “id -u” and “id -g” commands on the
terminal.
Docker instantiation failed with error: 500 Server Error: Internal Server Error (“OCI runtime create failed: container_linux.go:349: starting container process caused “process_linux.go:449: container init caused “process_linux.go:432: running prestart hook 0 caused \“error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: requirement error: unsatisfied condition: cuda>=11.1, please update your driver to a newer version, or use an earlier cuda container\\n\”””: unknown”)

So you recommend me to log into or create account here Enterprise Support Portal and enter a ticker there right?

I suspect the issue you run into next is that the next Docker container that gets launched didn’t inherit the NVIDIA_REQUIRE_CUDA environment. This workflow seems like it’s more annoying than it needs to be for you (and that updating to the 460 driver would be the right answer!)…I’d recommend Enterprise Support. :-)

As a DGX customer, you can login to the Enterprise Support Portal as you indicated, or simply send an email to EnterpriseSupport@nvidia.com . They will ask for your DGX serial number and your email address to get started.

ScottE

Moved thread to the TLT forum to get more eyes on it. :-)

@leo2105 ,per TLT Quick Start Guide — Transfer Learning Toolkit 3.0 documentation the container really need the 455+ driver, which means the right answer is upgrading the driver on the DGX (460 is available…). I know you said you can’t upgrade your driver, so I suspect this means you’re going to be kinda stuck. Sorry.

ScottE

Yeah, I imagined that. Actually, I have already talked to my boss about the upgrading and we created a ticket in Enterprise Support. Thank you so much for the help.

Reference for upgrading to 455 driver.

$ apt-cache search nvidia-driver
$ sudo apt install nvidia-driver-455
$ sudo reboot

Since this was DGX A100, we want the -server variant (aka nvidia-driver-460-server).

I’d recommend following the instructions at DGX OS 5.0 User Guide :: DGX Systems Documentation to upgrade to the 460 driver (assuming that ends up being an option!)