Hi, I am trying to train detectnet_v2 in TLT 3.0 in a DGX of A100-SXM4-40GB with MIG (multi instance GPUs)
NVIDIA-DRIVER : 450.80.02
I am using a docker container to run TLT:
docker run -d --name tlt-leo -it --rm --runtime nvidia --gpus 'device=6:1' -p 4444:4444 -v "/home/levera/tlt3_experiments/tlt_cv_samples_v1.1.0/":"/workspace" -v /var/run/docker.sock:/var/run/docker.sock --env NVIDIA_REQUIRE_CUDA='cuda>=11.1' --env NVIDIA_DRIVER_CAPABILITIES="all" nvcr.io/nvidia/tlt-streamanalytics:v3.0-py3 bash
Then:
docker exec -it tlt-leo bash
Inside of this container I am running all experiments.
I follow run all cell in the cells in the notebook but when I try to generate tf-record, it returns this error:
Converting Tfrecords for kitti trainval dataset
2021-07-13 21:22:38,973 [INFO] root: Registry: [‘nvcr.io’]
2021-07-13 21:22:39,459 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the “user”:“UID:GID” in the
DockerOptions portion of the ~/.tlt_mounts.json file. You can obtain your
users UID and GID by using the “id -u” and “id -g” commands on the
terminal.
Docker instantiation failed with error: 500 Server Error: Internal Server Error (“OCI runtime create failed: container_linux.go:349: starting container process caused “process_linux.go:449: container init caused "process_linux.go:432: running prestart hook 0 caused \"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: requirement error: unsatisfied condition: cuda>=11.1, please update your driver to a newer version, or use an earlier cuda container\\n\""”: unknown”)
When I did nvcc -V:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0
I also tried,
docker run -d --name tlt-leo -it --rm --runtime nvidia --gpus all -p 4444:4444 -v "/home/levera/tlt3_experiments/tlt_cv_samples_v1.1.0/":"/workspace" -v /var/run/docker.sock:/var/run/docker.sock --env NVIDIA_REQUIRE_CUDA='cuda>=11.1' --env NVIDIA_DRIVER_CAPABILITIES="all" --env NVIDIA_REQUIRE_DRIVER="driver>=455" nvcr.io/nvidia/tlt-streamanalytics:v3.0-py3 bash
docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused “process_linux.go:449: container init caused "process_linux.go:432: running prestart hook 0 caused \"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: requirement error: unsatisfied condition: driver>=455\\n\""”: unknown
So, Upgrading the version of nvidia-driver to >=455 of the DGX is the only one solution? I dont understand why the first error mention cuda version.
At this moment , I cant upgrade nvidia-driver version, do you have another solution?
Thanks in advance.