Detectnet_v2 notebook stuck at tfrecords conversion step

user149500 · October 9, 2024, 7:19pm

• Hardware NVIDIA TITAN Xp . Computer has Intel® Xeon(R) CPU X5680 @ 3.33GHz × 12 with 24Gb ram and is running Ubuntu 22.04.5 LTS
• Network Type Detectnet_v2
• TAO Version (Please run “tlt info --verbose” and share “docker_tag” here)
Configuration of the TAO Toolkit Instance

task_group:
model:
dockers:
nvidia/tao/tao-toolkit:
5.5.0-pyt:
docker_registry: nvcr.io
tasks:
1. action_recognition
2. centerpose
3. visual_changenet
4. deformable_detr
5. dino
6. grounding_dino
7. mask_grounding_dino
8. mask2former
9. mal
10. ml_recog
11. ocdnet
12. ocrnet
13. optical_inspection
14. pointpillars
15. pose_classification
16. re_identification
17. classification_pyt
18. segformer
19. bevfusion
5.0.0-tf1.15.5:
docker_registry: nvcr.io
tasks:
1. bpnet
2. classification_tf1
3. converter
4. detectnet_v2
5. dssd
6. efficientdet_tf1
7. faster_rcnn
8. fpenet
9. lprnet
10. mask_rcnn
11. multitask_classification
12. retinanet
13. ssd
14. unet
15. yolo_v3
16. yolo_v4
17. yolo_v4_tiny
5.5.0-tf2:
docker_registry: nvcr.io
tasks:
1. classification_tf2
2. efficientdet_tf2
dataset:
dockers:
nvidia/tao/tao-toolkit:
5.5.0-data-services:
docker_registry: nvcr.io
tasks:
1. augmentation
2. auto_label
3. annotations
4. analytics
deploy:
dockers:
nvidia/tao/tao-toolkit:
5.5.0-deploy:
docker_registry: nvcr.io
tasks:
1. visual_changenet
2. centerpose
3. classification_pyt
4. classification_tf1
5. classification_tf2
6. deformable_detr
7. detectnet_v2
8. dino
9. dssd
10. efficientdet_tf1
11. efficientdet_tf2
12. faster_rcnn
13. grounding_dino
14. mask_grounding_dino
15. mask2former
16. lprnet
17. mask_rcnn
18. ml_recog
19. multitask_classification
20. ocdnet
21. ocrnet
22. optical_inspection
23. retinanet
24. segformer
25. ssd
26. trtexec
27. unet
28. yolo_v3
29. yolo_v4
30. yolo_v4_tiny
format_version: 3.0
toolkit_version: 5.5.0
published_date: 08/26/2024
• Training spec file(
detectnet_v2_tfrecords_kitti_trainval.txt (334 Bytes)
)
TFrecords conversion spec file for kitti training
kitti_config {
root_directory_path: “/home/harold/workspace/tao-experiments/data/training”
image_dir_name: “image_2”
label_dir_name: “label_2”
image_extension: “.png”
partition_mode: “random”
num_partitions: 2
val_split: 14
num_shards: 10
}
image_directory_path: “/home/harold/workspace/tao-experiments/data/training”

•I tried to run the next line in the notebook three times, the first two it looked like it was making progress for quite some time (more than 15minutes) but then crashed when I wasn’t watching and I got a Firefox message saying “gah. your tab has crashed” I then disabled the screensaver and tried again and then I get a different message (this doesn’t take long at all)

Creating a new directory for the output tfrecords dump.

print(“Converting Tfrecords for kitti trainval dataset”)
!mkdir -p $LOCAL_DATA_DIR/tfrecords && rm -rf $LOCAL_DATA_DIR/tfrecords/*
!tao model detectnet_v2 dataset_convert
-d $SPECS_DIR/detectnet_v2_tfrecords_kitti_trainval.txt
-o $DATA_DOWNLOAD_DIR/tfrecords/kitti_trainval/kitti_trainval
-r $USER_EXPERIMENT_DIR/

#Output from the above cell:
Converting Tfrecords for kitti trainval dataset
2024-10-09 11:35:27,059 [TAO Toolkit] [INFO] root 160: Registry: [‘nvcr.io’]
2024-10-09 11:35:27,166 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
2024-10-09 11:35:27,187 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
2024-10-09 11:35:28,475 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

Here is the output of nvidia-smi:

I am a complete beginner at this as well as this being my first post in this forum. So apologies if I’ve botched this post (it feels too long!) The machine has a fresh install of Ubuntu. Installing Tao and then running this Jupyter notebook has been my first activity with it.

Morganh · October 10, 2024, 3:04am

I am afraid docker pull exhausts the browser’s memory and forces a reload.

Maybe you can refer to python - How to increase Jupyter notebook Memory limit? - Stack Overflow to increase Jupyter notebook memory.

You can also open a terminal to docker pull nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5 in advanced.

user149500 · October 10, 2024, 5:00pm

Ah! Thank you for the quick reply Morganh! I’m looking into that thread on expanding the Jupyter notebook now.

I assume that I could start again from the beginning of the tutorial and just use the notebook as a template and cut/paste into a fresh terminal (with conda launched) instead of running the steps through the notebook?

Morganh · October 11, 2024, 1:59am

Yes, you can run in the terminal by leveraging the notebook’s commands.

user149500 · October 12, 2024, 5:45am

Thank you again Morganh! I have (with some learning along the way) gotten back to the same point in the workbook where I last had trouble and I get the same response. Here’s my command line with the output that followed:

(launcher) harold@Ubuntu-Training-Computer:~/tao_tutorials/notebooks/tao_launcher_starter_kit/detectnet_v2$ tao model detectnet_v2 dataset_convert -d $LOCAL_SPECS_DIR/detectnet_v2_tfrecords_kitti_trainval.txt -o $LOCAL_DATA_DIR/tfrecords/kitti_trainval/kitti_trainval -r $LOCAL_EXPERIMENT_DIR
2024-10-11 22:34:30,937 [TAO Toolkit] [INFO] root 160: Registry: [‘nvcr.io’]
2024-10-11 22:34:31,042 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
2024-10-11 22:34:31,062 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
2024-10-11 22:34:32,330 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

I did try the suggestion of:
docker pull nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
first as well, but that made no difference (it returned a message that said:
Status: Image is up to date for nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5)

Any idea what the issue might be?
Thank you, for looking at this!

Morganh · October 12, 2024, 4:06pm

Please open a terminal
$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5 /bin/bash
Then you will login into the docker and run inside the docker.
# detectnet_v2 dataset_convert xxx

user149500 · October 12, 2024, 9:43pm

Thank you Morganh!
Apparently things in this docker container cannot see my file structure outside it, so it cant see my images or specs doc, etc… I’m trying to figure out how to pass them in. Am I on the right track or should I be focusing on trying to increase the memory of the browser?

Morganh · October 13, 2024, 2:23am

You can use -v /your/localpath:/docker/path. More info, please check docker usage.

user149500 · October 16, 2024, 7:10pm

Thank you again Morgan, I was able to use the -v you suggested.
When I started the docker I got this message:

=======================

=== TAO Toolkit TF1 ===

=======================

NVIDIA Release 5.0.0-TF1 (build 52693369)

TAO Toolkit Version 5.0.0

This container image and its contents are governed by the TAO Toolkit End User License Agreement.

By pulling and using the container, you accept the terms and conditions of this license:

ERROR: This container was built for CPUs supporting at least the AVX instruction set, but

   the CPU detected was Intel(R) Xeon(R) CPU           X5680  @ 3.33GHz, which does not report

   support for AVX.  An Illegal Instrution exception at runtime is likely to result.

   See https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#CPUs_with_AVX .

NOTE: The SHMEM allocation limit is set to the default of 64MB. This may be

insufficient for TAO Toolkit. NVIDIA recommends the use of the following flags:

docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 …

I was under the impression that my Intel Xenon X5680 @3.33GHz would be suitable (slow is ok) for TAO 5.5. Is this AVX support required for detectnet_v2 dataset_convert?

From my terminal in the docker container:
detectnet_v2 --help
gives me:
Illegal instruction (core dumped)

tao --help
gives me
bash: tao: command not found

Morganh · October 17, 2024, 8:43am

Suggest to use cpu which supports AVX2.
Refer to Core dump Illegal Instruction on detectnet_v2 example - #15 by Morganh and python - Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2 - Stack Overflow.

user149500 · October 18, 2024, 6:40am

Ok, that’s unfortunate! Thank you Morgan.
I followed those threads. It sounds like I need a new computer to run this on then. I have a NVIDIA Titan Xp. If I get a new motherboard and a CPU that supports AVX2 with at least 8 cores and more than 8GB of RAM are there any other potential pitfalls that would keep TAO from working that I should be aware of when looking for new equipment?

Morganh · October 18, 2024, 9:51am

Please refer to Getting Started - NVIDIA Docs.

user149500 · October 28, 2024, 8:28pm

Hi Morgan, I was hopeful that a newer computer would get me past these difficulties, but I also ran into an error at this same step, so I will continue in this thread.
The CPU is now an Intel i9-9820X (10 cores) there is 32Gb of RAM and the graphics card is the same NVIDIA TITAN Xp.

Here is my command line with the initial output (it looked promising) and then subsequent messages. I am running this at the command line in the miniconda environment -not through a Jupyter notebook to avoid memory issues:
(launcher) harold@TrainingComp:~/tao_tutorials$ tao model detectnet_v2 dataset_convert
-d $SPECS_DIR/detectnet_v2_tfrecords_kitti_trainval.txt
-o $DATA_DOWNLOAD_DIR/tfrecords/kitti_trainval/kitti_trainval
-r $USER_EXPERIMENT_DIR/
2024-10-28 13:09:10,148 [TAO Toolkit] [INFO] root 160: Registry: [‘nvcr.io’]
2024-10-28 13:09:10,273 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
2024-10-28 13:09:10,282 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 322: The required docker doesn’t exist locally/the manifest has changed. Pulling a new docker.
2024-10-28 13:09:10,282 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 173: Pulling the required container. This may take several minutes if you’re doing this for the first time. Please wait here.
…
Pulling from repository: nvcr.io/nvidia/tao/tao-toolkit
[Download 7608715873ec] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━ 80% 0:00:01
[Download bc615fe751be] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺ 98% 0:00:01
[Download 61c7c9e56778] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 99% 0:00:01
[Extract 7608715873ec] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
[Download 1f749c08065c] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
[Download b777129e9daa] ━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━ 38% 0:02:07
[Extract bc615fe751be] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
[Download 8e47ddf5daef] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% -:–:–
[Download 556ab1e8d85a] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% -:–:–
[Download b704bd04fbf5] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% -:–:–
[Download 3ff10bd8cf35] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
[Download 7be515e856a0] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 99% 0:00:01
[Download d115618a5cab] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺ 98% 0:00:01
[Extract 61c7c9e56778] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
[Download bd04c8820090] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% -:–:–
[Download 0384305027fa] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 99% 0:00:01
[Extract 1f749c08065c] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
[Download 4f4fb700ef54] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% -:–:–
[Download df5b36ff9510] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% -:–:–
[Download 7e69bcc98e9c] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺ 99% 0:00:01
[Download ac9b98675d88] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% -:–:–
[Download 6e8541639381] ━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━ 68% -:–:–
[Download 1150aa1cb86b] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━ 80% 0:00:08
Traceback (most recent call last):
File “/home/harold/miniconda3/envs/launcher/bin/tao”, line 8, in
sys.exit(main())
File “/home/harold/miniconda3/envs/launcher/lib/python3.10/site-packages/nvidia_tao_cli/entrypoint/tao_launcher.py”, line 134, in main
instance.launch_command(
File “/home/harold/miniconda3/envs/launcher/lib/python3.10/site-packages/nvidia_tao_cli/components/instance_handler/local_instance.py”, line 382, in launch_command
docker_handler.run_container(command)
File “/home/harold/miniconda3/envs/launcher/lib/python3.10/site-packages/nvidia_tao_cli/components/docker_handler/docker_handler.py”, line 325, in run_container
self.pull()
File “/home/harold/miniconda3/envs/launcher/lib/python3.10/site-packages/nvidia_tao_cli/components/docker_handler/docker_handler.py”, line 187, in pull
docker_pull_progress(line, progress)
File “/home/harold/miniconda3/envs/launcher/lib/python3.10/site-packages/nvidia_tao_cli/components/docker_handler/docker_handler.py”, line 66, in docker_pull_progress
TASKS[idx] = progress.add_task(f"{idx}", total=line[‘progressDetail’][‘total’])
KeyError: ‘total’
(launcher) harold@TrainingComp:~/tao_tutorials$

user149500 · October 28, 2024, 9:06pm

Hi Morgan, I thought I would go ahead and try the command: docker pull nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
that you had suggested previously. It took some time but appeared to execute fine. I then re-tried the tfrecords convert and got a new response, here is my command line and the output. (I’ll wait for your input at this point):
(launcher) harold@TrainingComp:~/tao_tutorials$ tao model detectnet_v2 dataset_convert -d $SPECS_DIR/detectnet_v2_tfrecords_kitti_trainval.txt -o $DATA_DOWNLOAD_DIR/tfrecords/kitti_trainval/kitti_trainval -r $USER_EXPERIMENT_DIR/
2024-10-28 14:00:34,605 [TAO Toolkit] [INFO] root 160: Registry: [‘nvcr.io’]
2024-10-28 14:00:34,768 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
2024-10-28 14:00:34,801 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
2024-10-28 21:00:35.945791: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcudart.so.12
2024-10-28 21:00:35,987 [TAO Toolkit] [WARNING] tensorflow 40: Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
2024-10-28 21:00:37,534 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable TF_ALLOW_IOLIBS=1.
2024-10-28 21:00:37,571 [TAO Toolkit] [WARNING] tensorflow 42: TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable TF_ALLOW_IOLIBS=1.
2024-10-28 21:00:37,575 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable TF_ALLOW_IOLIBS=1.
2024-10-28 21:00:39,167 [TAO Toolkit] [WARNING] matplotlib 500: Matplotlib created a temporary config/cache directory at /tmp/matplotlib-infh5f3y because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
2024-10-28 21:00:39,532 [TAO Toolkit] [INFO] matplotlib.font_manager 1633: generated new fontManager
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
WARNING:tensorflow:TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable TF_ALLOW_IOLIBS=1.
2024-10-28 21:00:41,363 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable TF_ALLOW_IOLIBS=1.
WARNING:tensorflow:TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable TF_ALLOW_IOLIBS=1.
2024-10-28 21:00:41,400 [TAO Toolkit] [WARNING] tensorflow 42: TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable TF_ALLOW_IOLIBS=1.
WARNING:tensorflow:TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable TF_ALLOW_IOLIBS=1.
2024-10-28 21:00:41,404 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable TF_ALLOW_IOLIBS=1.
Traceback (most recent call last):
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/dataset_convert.py”, line 168, in
raise e
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/dataset_convert.py”, line 137, in
main()
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/dataset_convert.py”, line 113, in main
status_logging.StatusLogger(
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/logging/logging.py”, line 203, in init
self.l_file = open(self.log_path, “a” if append else “w”)
PermissionError: [Errno 13] Permission denied: ‘/status.json’
Execution status: FAIL
2024-10-28 14:00:59,567 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.
(launcher) harold@TrainingComp:~/tao_tutorials$

Morganh · October 29, 2024, 5:04am

Please refer to Tao model detectnet_v2 dataset_convert Error : permission denied : status.json → Try to Run tao detectnet_v2 command inside of docker and fork tao toolkit tf - #28 by Morganh.

user149500 · October 29, 2024, 9:35pm

Thank you Morgan,
I wanted to update here:
First, I removed “user”:“1000:1000” from the mounts file. It appeared to work at this point, but afterwards there were no files in the tfrecords directory.
Then I ran the docker for the
$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5 /bin/bash
Then you will login into the docker and run inside the docker.
# detectnet_v2 dataset_convert xxx
passing my file structure in with -v

I then had some issues with directories not being found, but traced that to my configuration file not having quite the right paths (there was an extra “tao-experiments” after the “/workspace”) after getting rid of those, the run was successful and now I have 20 files in kitti_trainval.

Thank you for your help!
I think you can consider this closed now, although let me know if you want me to describe anything better for those out there that may encounter the same issues.

I am now on to the next step (and hopefully no more issues!)

Morganh · October 30, 2024, 1:52am

Thanks for the info. Glad to know it is working now.

system · December 5, 2024, 2:35am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Detectnet2 TAO Toolkit model training fail on formating dataset on kitti format TAO Toolkit	69	972	January 22, 2024
Tao Training Model Error TAO Toolkit	7	498	January 15, 2024
Tao toolkit detectnet training kitty format error TAO Toolkit	10	419	December 8, 2023
Detectnetv2 tfrecords error TAO Toolkit	4	425	January 13, 2024
Tao toolkit observations TAO Toolkit	56	1000	May 29, 2024
Tao-converter [ERROR] Failed to parse the model, please check the encoding key to make sure its correct TAO Toolkit deepstream	70	1713	July 10, 2023
TAO 5.0 failed to train TAO Toolkit	8	548	August 1, 2023
Fine-tuning Peoplenet Resnet 34 on AWS. "failed to connect to vfs socket" TAO Toolkit	16	930	October 6, 2023
Tao model detectnet_v2 dataset_convert Error : permission denied : status.json TAO Toolkit	2	181	May 19, 2024
Error in Generating TFrecords for yolov4 TAO Toolkit	38	1237	May 17, 2022

Detectnet_v2 notebook stuck at tfrecords conversion step

Creating a new directory for the output tfrecords dump.

Related topics