• Hardware NVIDIA TITAN Xp . Computer has Intel® Xeon(R) CPU X5680 @ 3.33GHz × 12 with 24Gb ram and is running Ubuntu 22.04.5 LTS
• Network Type Detectnet_v2
• TAO Version (Please run “tlt info --verbose” and share “docker_tag” here)
Configuration of the TAO Toolkit Instance
•I tried to run the next line in the notebook three times, the first two it looked like it was making progress for quite some time (more than 15minutes) but then crashed when I wasn’t watching and I got a Firefox message saying “gah. your tab has crashed” I then disabled the screensaver and tried again and then I get a different message (this doesn’t take long at all)
Creating a new directory for the output tfrecords dump.
I am a complete beginner at this as well as this being my first post in this forum. So apologies if I’ve botched this post (it feels too long!) The machine has a fresh install of Ubuntu. Installing Tao and then running this Jupyter notebook has been my first activity with it.
Ah! Thank you for the quick reply Morganh! I’m looking into that thread on expanding the Jupyter notebook now.
I assume that I could start again from the beginning of the tutorial and just use the notebook as a template and cut/paste into a fresh terminal (with conda launched) instead of running the steps through the notebook?
Thank you again Morganh! I have (with some learning along the way) gotten back to the same point in the workbook where I last had trouble and I get the same response. Here’s my command line with the output that followed:
Please open a terminal
$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5 /bin/bash
Then you will login into the docker and run inside the docker. # detectnet_v2 dataset_convert xxx
Thank you Morganh!
Apparently things in this docker container cannot see my file structure outside it, so it cant see my images or specs doc, etc… I’m trying to figure out how to pass them in. Am I on the right track or should I be focusing on trying to increase the memory of the browser?
Thank you again Morgan, I was able to use the -v you suggested.
When I started the docker I got this message:
=======================
=== TAO Toolkit TF1 ===
=======================
NVIDIA Release 5.0.0-TF1 (build 52693369)
TAO Toolkit Version 5.0.0
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the TAO Toolkit End User License Agreement.
By pulling and using the container, you accept the terms and conditions of this license:
ERROR: This container was built for CPUs supporting at least the AVX instruction set, but
the CPU detected was Intel(R) Xeon(R) CPU X5680 @ 3.33GHz, which does not report
support for AVX. An Illegal Instrution exception at runtime is likely to result.
See https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#CPUs_with_AVX .
NOTE: The SHMEM allocation limit is set to the default of 64MB. This may be
insufficient for TAO Toolkit. NVIDIA recommends the use of the following flags:
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 …
I was under the impression that my Intel Xenon X5680 @3.33GHz would be suitable (slow is ok) for TAO 5.5. Is this AVX support required for detectnet_v2 dataset_convert?
From my terminal in the docker container:
detectnet_v2 --help
gives me:
Illegal instruction (core dumped)
Ok, that’s unfortunate! Thank you Morgan.
I followed those threads. It sounds like I need a new computer to run this on then. I have a NVIDIA Titan Xp. If I get a new motherboard and a CPU that supports AVX2 with at least 8 cores and more than 8GB of RAM are there any other potential pitfalls that would keep TAO from working that I should be aware of when looking for new equipment?
Hi Morgan, I was hopeful that a newer computer would get me past these difficulties, but I also ran into an error at this same step, so I will continue in this thread.
The CPU is now an Intel i9-9820X (10 cores) there is 32Gb of RAM and the graphics card is the same NVIDIA TITAN Xp.
Here is my command line with the initial output (it looked promising) and then subsequent messages. I am running this at the command line in the miniconda environment -not through a Jupyter notebook to avoid memory issues:
(launcher) harold@TrainingComp:~/tao_tutorials$ tao model detectnet_v2 dataset_convert
-d $SPECS_DIR/detectnet_v2_tfrecords_kitti_trainval.txt
-o $DATA_DOWNLOAD_DIR/tfrecords/kitti_trainval/kitti_trainval
-r $USER_EXPERIMENT_DIR/
2024-10-28 13:09:10,148 [TAO Toolkit] [INFO] root 160: Registry: [‘nvcr.io’]
2024-10-28 13:09:10,273 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
2024-10-28 13:09:10,282 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 322: The required docker doesn’t exist locally/the manifest has changed. Pulling a new docker.
2024-10-28 13:09:10,282 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 173: Pulling the required container. This may take several minutes if you’re doing this for the first time. Please wait here.
…
Pulling from repository: nvcr.io/nvidia/tao/tao-toolkit
[Download 7608715873ec] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━ 80% 0:00:01
[Download bc615fe751be] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺ 98% 0:00:01
[Download 61c7c9e56778] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 99% 0:00:01
[Extract 7608715873ec] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
[Download 1f749c08065c] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
[Download b777129e9daa] ━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━ 38% 0:02:07
[Extract bc615fe751be] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
[Download 8e47ddf5daef] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% -:–:–
[Download 556ab1e8d85a] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% -:–:–
[Download b704bd04fbf5] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% -:–:–
[Download 3ff10bd8cf35] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
[Download 7be515e856a0] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 99% 0:00:01
[Download d115618a5cab] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺ 98% 0:00:01
[Extract 61c7c9e56778] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
[Download bd04c8820090] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% -:–:–
[Download 0384305027fa] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 99% 0:00:01
[Extract 1f749c08065c] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
[Download 4f4fb700ef54] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% -:–:–
[Download df5b36ff9510] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% -:–:–
[Download 7e69bcc98e9c] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺ 99% 0:00:01
[Download ac9b98675d88] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% -:–:–
[Download 6e8541639381] ━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━ 68% -:–:–
[Download 1150aa1cb86b] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━ 80% 0:00:08
Traceback (most recent call last):
File “/home/harold/miniconda3/envs/launcher/bin/tao”, line 8, in
sys.exit(main())
File “/home/harold/miniconda3/envs/launcher/lib/python3.10/site-packages/nvidia_tao_cli/entrypoint/tao_launcher.py”, line 134, in main
instance.launch_command(
File “/home/harold/miniconda3/envs/launcher/lib/python3.10/site-packages/nvidia_tao_cli/components/instance_handler/local_instance.py”, line 382, in launch_command
docker_handler.run_container(command)
File “/home/harold/miniconda3/envs/launcher/lib/python3.10/site-packages/nvidia_tao_cli/components/docker_handler/docker_handler.py”, line 325, in run_container
self.pull()
File “/home/harold/miniconda3/envs/launcher/lib/python3.10/site-packages/nvidia_tao_cli/components/docker_handler/docker_handler.py”, line 187, in pull
docker_pull_progress(line, progress)
File “/home/harold/miniconda3/envs/launcher/lib/python3.10/site-packages/nvidia_tao_cli/components/docker_handler/docker_handler.py”, line 66, in docker_pull_progress
TASKS[idx] = progress.add_task(f"{idx}", total=line[‘progressDetail’][‘total’])
KeyError: ‘total’
(launcher) harold@TrainingComp:~/tao_tutorials$
Hi Morgan, I thought I would go ahead and try the command: docker pull nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
that you had suggested previously. It took some time but appeared to execute fine. I then re-tried the tfrecords convert and got a new response, here is my command line and the output. (I’ll wait for your input at this point):
(launcher) harold@TrainingComp:~/tao_tutorials$ tao model detectnet_v2 dataset_convert -d $SPECS_DIR/detectnet_v2_tfrecords_kitti_trainval.txt -o $DATA_DOWNLOAD_DIR/tfrecords/kitti_trainval/kitti_trainval -r $USER_EXPERIMENT_DIR/
2024-10-28 14:00:34,605 [TAO Toolkit] [INFO] root 160: Registry: [‘nvcr.io’]
2024-10-28 14:00:34,768 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
2024-10-28 14:00:34,801 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
2024-10-28 21:00:35.945791: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcudart.so.12
2024-10-28 21:00:35,987 [TAO Toolkit] [WARNING] tensorflow 40: Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
2024-10-28 21:00:37,534 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable TF_ALLOW_IOLIBS=1.
2024-10-28 21:00:37,571 [TAO Toolkit] [WARNING] tensorflow 42: TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable TF_ALLOW_IOLIBS=1.
2024-10-28 21:00:37,575 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable TF_ALLOW_IOLIBS=1.
2024-10-28 21:00:39,167 [TAO Toolkit] [WARNING] matplotlib 500: Matplotlib created a temporary config/cache directory at /tmp/matplotlib-infh5f3y because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
2024-10-28 21:00:39,532 [TAO Toolkit] [INFO] matplotlib.font_manager 1633: generated new fontManager
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
WARNING:tensorflow:TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable TF_ALLOW_IOLIBS=1.
2024-10-28 21:00:41,363 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable TF_ALLOW_IOLIBS=1.
WARNING:tensorflow:TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable TF_ALLOW_IOLIBS=1.
2024-10-28 21:00:41,400 [TAO Toolkit] [WARNING] tensorflow 42: TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable TF_ALLOW_IOLIBS=1.
WARNING:tensorflow:TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable TF_ALLOW_IOLIBS=1.
2024-10-28 21:00:41,404 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable TF_ALLOW_IOLIBS=1.
Traceback (most recent call last):
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/dataset_convert.py”, line 168, in
raise e
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/dataset_convert.py”, line 137, in
main()
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/dataset_convert.py”, line 113, in main
status_logging.StatusLogger(
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/logging/logging.py”, line 203, in init
self.l_file = open(self.log_path, “a” if append else “w”)
PermissionError: [Errno 13] Permission denied: ‘/status.json’
Execution status: FAIL
2024-10-28 14:00:59,567 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.
(launcher) harold@TrainingComp:~/tao_tutorials$
Thank you Morgan,
I wanted to update here:
First, I removed “user”:“1000:1000” from the mounts file. It appeared to work at this point, but afterwards there were no files in the tfrecords directory.
Then I ran the docker for the
$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5 /bin/bash
Then you will login into the docker and run inside the docker. # detectnet_v2 dataset_convert xxx
passing my file structure in with -v
I then had some issues with directories not being found, but traced that to my configuration file not having quite the right paths (there was an extra “tao-experiments” after the “/workspace”) after getting rid of those, the run was successful and now I have 20 files in kitti_trainval.
Thank you for your help!
I think you can consider this closed now, although let me know if you want me to describe anything better for those out there that may encounter the same issues.
I am now on to the next step (and hopefully no more issues!)