• Hardware (A6000
• Network Type (Detectnet_v2)
• TLT Version 4 ? (Please run “tlt info --verbose” and share “docker_tag” here - this doesn’t work )
Python Version 3.7
• Docker version > 20
Hello all,
I’ve been following the ‘getting started’ with tao toolkit along with the setup video. I’ve opened up detectnetv2 from the tao_launcher_starter_kit and was going through the script until I reached a problem around tfrecord convert :
Creating a new directory for the output tfrecords dump.
Converting Tfrecords for kitti trainval dataset
Traceback (most recent call last):
File “/home/cymru/miniconda3/bin/tao”, line 8, in
sys.exit(main())
File “/home/cymru/miniconda3/lib/python3.7/site-packages/tlt/entrypoint/tao.py”, line 116, in main
args[1:]
File “/home/cymru/miniconda3/lib/python3.7/site-packages/tlt/components/instance_handler/local_instance.py”, line 296, in launch_command
docker_logged_in(required_registry=self.task_map[task].docker_registry)
File “/home/cymru/miniconda3/lib/python3.7/site-packages/tlt/components/instance_handler/utils.py”, line 137, in docker_logged_in
data = load_config_file(docker_config)
File “/home/cymru/miniconda3/lib/python3.7/site-packages/tlt/components/instance_handler/utils.py”, line 74, in load_config_file
“No file found at: {}. Did you run docker login?”.format(config_path)
AssertionError: Config path must be a valid unix path. No file found at: /home/cymru/.docker/config.json. Did you run docker login?
To provide a bit of context, I’m trying to run tao tookit using the launcher cli (stated as option 1 on this website : TAO Toolkit Getting Started | NVIDIA NGC)
Regarding the links you have sent me:
I looked at the first link regarding using sudo chown.
The message I get after running the first command is:
chown: cannot access ‘/home/cymru/.docker’: No such file or directory
The second link I assume is looking at running tao toolkit directly from a container. But I assume I do not need to do this if I’m using the getting-started juypter notebook and downloading the tao toolkit via pip3?
UPDATE: I got rid of the message by following the docker ce post installation steps to run without sudo then I did the docker login nvcr.io again. It then finally generated the .docker folder in my home directory for me. It’s now carrying on as normal.
Because i do not have any containers running when i go through the launcher cli option in this getting started notwbook. I typed ‘docker container ls’ and I don’t see the tao toolkit running , so I’m just wondering how I can find the directory for the tao docker. I know i specified it in the tao_mounts.json file but don’t know where to find it.
Thanks - but do I really need to run that command? Since I was able to download and train the model successfully without needing to do that.
I have a different problem this morning when I came back to my computer. I’m now evaluating the trained model and I get the following output:
2023-02-14 09:39:05,194 [INFO] root: Registry: ['nvcr.io']
2023-02-14 09:39:05,233 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf1.15.5
Docker instantiation failed with error: 500 Server Error: Internal Server Error ("failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: nvml error: driver/library version mismatch: unknown")
so… I’ve typed nvidia-smi and I get the following output:
Failed to initialize NVML: Driver/library version mismatch
and I ran sudo docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi and I get the following output:
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: nvml error: driver/library version mismatch: unknown.
Funnily enough this has never happened before as it was working perfectly until now.
I’ve also ran dpkg -l grep nvidia and ls -l /usr/lib/x86_64-linux-gnu/*nvidia and a whole other commands to understand the version number i have. Please see the log file attached.