Tao Training Model Error

livekha · January 15, 2024, 4:34am

Please provide the following information when requesting support.

• Hardware (T4)
• Network Type (Detectnet_v2)
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
Configuration of the TAO Toolkit Instance

task_group:
model:
dockers:
nvidia/tao/tao-toolkit:
5.0.0-tf2.11.0:
docker_registry: nvcr.io
tasks:
1. classification_tf2
2. efficientdet_tf2
5.0.0-tf1.15.5:
docker_registry: nvcr.io
tasks:
1. bpnet
2. classification_tf1
3. converter
4. detectnet_v2
5. dssd
6. efficientdet_tf1
7. faster_rcnn
8. fpenet
9. lprnet
10. mask_rcnn
11. multitask_classification
12. retinanet
13. ssd
14. unet
15. yolo_v3
16. yolo_v4
17. yolo_v4_tiny
5.2.0-pyt2.1.0:
docker_registry: nvcr.io
tasks:
1. action_recognition
2. centerpose
3. deformable_detr
4. dino
5. mal
6. ml_recog
7. ocdnet
8. ocrnet
9. optical_inspection
10. pointpillars
11. pose_classification
12. re_identification
13. visual_changenet
5.2.0-pyt1.14.0:
docker_registry: nvcr.io
tasks:
1. classification_pyt
2. segformer
dataset:
dockers:
nvidia/tao/tao-toolkit:
5.2.0-data-services:
docker_registry: nvcr.io
tasks:
1. augmentation
2. auto_label
3. annotations
4. analytics
deploy:
dockers:
nvidia/tao/tao-toolkit:
5.2.0-deploy:
docker_registry: nvcr.io
tasks:
1. visual_changenet
2. centerpose
3. classification_pyt
4. classification_tf1
5. classification_tf2
6. deformable_detr
7. detectnet_v2
8. dino
9. dssd
10. efficientdet_tf1
11. efficientdet_tf2
12. faster_rcnn
13. lprnet
14. mask_rcnn
15. ml_recog
16. multitask_classification
17. ocdnet
18. ocrnet
19. optical_inspection
20. retinanet
21. segformer
22. ssd
23. trtexec
24. unet
25. yolo_v3
26. yolo_v4
27. yolo_v4_tiny
format_version: 3.0
toolkit_version: 5.2.0
published_date: 12/06/2023
• Training spec file(If have, please share here)

detectnet_v2_train_resnet18_kitti.txt (3.3 KB)

• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

!tao model detectnet_v2 train -e $SPECS_DIR/detectnet_v2_train_resnet18_kitti.txt
-r $USER_EXPERIMENT_DIR/experiment_dir_unpruned
-k $KEY
-n resnet18_detector
–gpus $NUM_GPUS
–use_amp

2024-01-15 12:31:42,838 [TAO Toolkit] [INFO] root 160: Registry: [‘nvcr.io’]
2024-01-15 12:31:42,913 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 361: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
2024-01-15 12:31:42,927 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
2024-01-15 04:31:43.605353: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcudart.so.12
2024-01-15 04:31:43,657 [TAO Toolkit] [WARNING] tensorflow 40: Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
2024-01-15 04:31:45,345 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable TF_ALLOW_IOLIBS=1.
2024-01-15 04:31:45,387 [TAO Toolkit] [WARNING] tensorflow 42: TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable TF_ALLOW_IOLIBS=1.
2024-01-15 04:31:45,391 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable TF_ALLOW_IOLIBS=1.
2024-01-15 04:31:47,259 [TAO Toolkit] [WARNING] matplotlib 500: Matplotlib created a temporary config/cache directory at /tmp/matplotlib-6fagkams because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
2024-01-15 04:31:47,569 [TAO Toolkit] [INFO] matplotlib.font_manager 1633: generated new fontManager
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
WARNING:tensorflow:TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable TF_ALLOW_IOLIBS=1.
2024-01-15 04:31:50,010 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable TF_ALLOW_IOLIBS=1.
WARNING:tensorflow:TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable TF_ALLOW_IOLIBS=1.
2024-01-15 04:31:50,050 [TAO Toolkit] [WARNING] tensorflow 42: TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable TF_ALLOW_IOLIBS=1.
WARNING:tensorflow:TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable TF_ALLOW_IOLIBS=1.
2024-01-15 04:31:50,054 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable TF_ALLOW_IOLIBS=1.
2024-01-15 04:31:51,888 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.common.logging.logging 197: Log file already exists at /workspace/tao-experiments/detectnet_v2/experiment_dir_unpruned/status.json
2024-01-15 04:31:51,888 [TAO Toolkit] [INFO] root 2102: Starting DetectNet_v2 Training job
2024-01-15 04:31:51,888 [TAO Toolkit] [INFO] main 817: Loading experiment spec at /workspace/tao-experiments/detectnet_v2/specs/detectnet_v2_train_resnet18_kitti.txt.
2024-01-15 04:31:51,889 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.spec_handler.spec_loader 113: Merging specification from /workspace/tao-experiments/detectnet_v2/specs/detectnet_v2_train_resnet18_kitti.txt
2024-01-15 04:31:51,890 [TAO Toolkit] [INFO] root 2102: 46:29 : ’ dbscan_min_samples: 0.0500000007451’: Couldn’t parse integer: 0.0500000007451
Traceback (most recent call last):
File “/usr/local/lib/python3.8/dist-packages/google/protobuf/text_format.py”, line 1702, in _ParseAbstractInteger
return int(text, 0)
ValueError: invalid literal for int() with base 0: ‘0.0500000007451’

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/usr/local/lib/python3.8/dist-packages/google/protobuf/text_format.py”, line 1652, in _ConsumeInteger
result = ParseInteger(tokenizer.token, is_signed=is_signed, is_long=is_long)
File “/usr/local/lib/python3.8/dist-packages/google/protobuf/text_format.py”, line 1674, in ParseInteger
result = _ParseAbstractInteger(text)
File “/usr/local/lib/python3.8/dist-packages/google/protobuf/text_format.py”, line 1704, in _ParseAbstractInteger
raise ValueError(‘Couldn't parse integer: %s’ % orig_text)
ValueError: Couldn’t parse integer: 0.0500000007451

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/train.py”, line 1067, in
raise e
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/train.py”, line 1046, in
main()
File “/usr/local/lib/python3.8/dist-packages/decorator.py”, line 232, in fun
return caller(func, *(extras + args), **kw)
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/utilities/timer.py”, line 46, in wrapped_fn
return_args = fn(*args, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/train.py”, line 1024, in main
run_experiment(
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/train.py”, line 821, in run_experiment
experiment_spec = load_experiment_spec(
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/spec_handler/spec_loader.py”, line 136, in load_experiment_spec
experiment_spec = load_proto(spec_path, experiment_spec, default_spec_path,
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/spec_handler/spec_loader.py”, line 114, in load_proto
_load_from_file(spec_path, proto_buffer)
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/spec_handler/spec_loader.py”, line 100, in _load_from_file
merge_text_proto(f.read(), pb2)
File “/usr/local/lib/python3.8/dist-packages/google/protobuf/text_format.py”, line 719, in Merge
return MergeLines(
File “/usr/local/lib/python3.8/dist-packages/google/protobuf/text_format.py”, line 793, in MergeLines
return parser.MergeLines(lines, message)
File “/usr/local/lib/python3.8/dist-packages/google/protobuf/text_format.py”, line 818, in MergeLines
self._ParseOrMerge(lines, message)
File “/usr/local/lib/python3.8/dist-packages/google/protobuf/text_format.py”, line 837, in _ParseOrMerge
self._MergeField(tokenizer, message)
File “/usr/local/lib/python3.8/dist-packages/google/protobuf/text_format.py”, line 967, in _MergeField
merger(tokenizer, message, field)
File “/usr/local/lib/python3.8/dist-packages/google/protobuf/text_format.py”, line 1042, in _MergeMessageField
self._MergeField(tokenizer, sub_message)
File “/usr/local/lib/python3.8/dist-packages/google/protobuf/text_format.py”, line 967, in _MergeField
merger(tokenizer, message, field)
File “/usr/local/lib/python3.8/dist-packages/google/protobuf/text_format.py”, line 1042, in _MergeMessageField
self._MergeField(tokenizer, sub_message)
File “/usr/local/lib/python3.8/dist-packages/google/protobuf/text_format.py”, line 967, in _MergeField
merger(tokenizer, message, field)
File “/usr/local/lib/python3.8/dist-packages/google/protobuf/text_format.py”, line 1042, in _MergeMessageField
self._MergeField(tokenizer, sub_message)
File “/usr/local/lib/python3.8/dist-packages/google/protobuf/text_format.py”, line 967, in _MergeField
merger(tokenizer, message, field)
File “/usr/local/lib/python3.8/dist-packages/google/protobuf/text_format.py”, line 1042, in _MergeMessageField
self._MergeField(tokenizer, sub_message)
File “/usr/local/lib/python3.8/dist-packages/google/protobuf/text_format.py”, line 967, in _MergeField
merger(tokenizer, message, field)
File “/usr/local/lib/python3.8/dist-packages/google/protobuf/text_format.py”, line 1076, in _MergeScalarField
value = _ConsumeInt32(tokenizer)
File “/usr/local/lib/python3.8/dist-packages/google/protobuf/text_format.py”, line 1573, in _ConsumeInt32
return _ConsumeInteger(tokenizer, is_signed=True, is_long=False)
File “/usr/local/lib/python3.8/dist-packages/google/protobuf/text_format.py”, line 1654, in _ConsumeInteger
raise tokenizer.ParseError(str(e))
google.protobuf.text_format.ParseError: 46:29 : ’ dbscan_min_samples: 0.0500000007451’: Couldn’t parse integer: 0.0500000007451

Morganh · January 15, 2024, 4:52am

Please set to below.
dbscan_min_samples: 1

livekha · January 15, 2024, 5:32am

changed and worked thanks

livekha · January 15, 2024, 5:33am

!tao model detectnet_v2 prune
-m $USER_EXPERIMENT_DIR/experiment_dir_unpruned/weights/resnet18_detector.tlt
-o $USER_EXPERIMENT_DIR/experiment_dir_pruned/resnet18_nopool_bn_detectnet_v2_pruned.tlt
-eq union
-pth 0.0000052
-k $KEY

2024-01-15 13:28:25,031 [TAO Toolkit] [INFO] root 160: Registry: [‘nvcr.io’]
2024-01-15 13:28:25,108 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 361: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
2024-01-15 13:28:25,122 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
2024-01-15 05:28:25.824632: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcudart.so.12
2024-01-15 05:28:25,876 [TAO Toolkit] [WARNING] tensorflow 40: Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
2024-01-15 05:28:27,610 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable TF_ALLOW_IOLIBS=1.
2024-01-15 05:28:27,654 [TAO Toolkit] [WARNING] tensorflow 42: TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable TF_ALLOW_IOLIBS=1.
2024-01-15 05:28:27,659 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable TF_ALLOW_IOLIBS=1.
2024-01-15 05:28:29,585 [TAO Toolkit] [WARNING] matplotlib 500: Matplotlib created a temporary config/cache directory at /tmp/matplotlib-8yad2m4d because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
2024-01-15 05:28:29,899 [TAO Toolkit] [INFO] matplotlib.font_manager 1633: generated new fontManager
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
WARNING:tensorflow:TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable TF_ALLOW_IOLIBS=1.
2024-01-15 05:28:31,903 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable TF_ALLOW_IOLIBS=1.
WARNING:tensorflow:TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable TF_ALLOW_IOLIBS=1.
2024-01-15 05:28:31,944 [TAO Toolkit] [WARNING] tensorflow 42: TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable TF_ALLOW_IOLIBS=1.
WARNING:tensorflow:TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable TF_ALLOW_IOLIBS=1.
2024-01-15 05:28:31,948 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable TF_ALLOW_IOLIBS=1.
Traceback (most recent call last):
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/prune.py”, line 46, in
raise e
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/prune.py”, line 30, in
main()
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/magnet_prune.py”, line 257, in main
run_pruning(args)
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/magnet_prune.py”, line 164, in run_pruning
final_model = model_io(
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/utils.py”, line 508, in model_io
assert os.path.exists(
AssertionError: Model not found at /workspace/tao-experiments/detectnet_v2/experiment_dir_unpruned/weights/resnet18_detector.tlt
Execution status: FAIL
2024-01-15 13:28:39,583 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

Morganh · January 15, 2024, 5:48am

Again, as we synced in other topics, please double check the ~/.tao_mounts.json file. To ease your work, please modify your tao_mounts.json file to set the same for “source” and “destination”.
Then modify the model path accordingly.

livekha · January 15, 2024, 6:23am

{
“Mounts”: [
{
“source”: “/home/glueck”,
“destination”: “/home/glueck”
}
],
“DockerOptions”: {
“user”: “1000:1000”
}

This is my tao_mounts.json file.

Morganh · January 15, 2024, 6:54am

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

OK, so please keep in mind that according to your latest setting, all the path from your local /home/glueck will be mapped into the path named /home/glueck inside the docker.
You can run docker run into the docker to double check, for example, ls /home/glueck.

Thus, the /workspace/tao-experiments/detectnet_v2/experiment_dir_unpruned/weights/resnet18_detector.tlt is needed to change to something as below.
/home/glueck/detectnet_v2/experiment_dir_unpruned/weights/resnet18_detector.tlt. You can check it.

system · January 30, 2024, 1:50am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Detectnet2 TAO Toolkit model training fail on formating dataset on kitti format TAO Toolkit	69	956	January 22, 2024
Tao toolkit detectnet training kitty format error TAO Toolkit	10	414	December 8, 2023
Detectnetv2 tfrecords error TAO Toolkit	4	418	January 13, 2024
TAO 5.0 failed to train TAO Toolkit	8	536	August 1, 2023
Tao deploy error - TAO Toolkit jetson , deepstream	3	19	February 2, 2025
Detectnet_v2 notebook stuck at tfrecords conversion step TAO Toolkit	17	49	October 30, 2024
Tao toolkit observations TAO Toolkit	56	853	May 29, 2024
Fine-tuning Peoplenet Resnet 34 on AWS. "failed to connect to vfs socket" TAO Toolkit	16	924	October 6, 2023
Tao model detectnet_v2 dataset_convert : ValueError: could not convert string to float: 'fallback"' TAO Toolkit	2	163	May 20, 2024
Spec file for yolo v3 not recognized TAO Toolkit	11	22	September 30, 2024

Tao Training Model Error

Related topics