Unable to export QAT yolov3 in int8

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)
RTX 4090

• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
yolo_v3
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
/home/ilias/anaconda3/envs/launcher/lib/python3.6/site-packages/tlt/init.py:20: DeprecationWarning:
The nvidia-tlt package will be deprecated soon. Going forward please migrate to using the nvidia-tao package.

warnings.warn(message, DeprecationWarning)
Configuration of the TAO Toolkit Instance

dockers:
nvidia/tao/tao-toolkit-tf:
v3.21.11-tf1.15.5-py3:
docker_registry: nvcr.io
tasks:
1. augment
2. bpnet
3. classification
4. dssd
5. emotionnet
6. efficientdet
7. fpenet
8. gazenet
9. gesturenet
10. heartratenet
11. lprnet
12. mask_rcnn
13. multitask_classification
14. retinanet
15. ssd
16. unet
17. yolo_v3
18. yolo_v4
19. yolo_v4_tiny
20. converter
v3.21.11-tf1.15.4-py3:
docker_registry: nvcr.io
tasks:
1. detectnet_v2
2. faster_rcnn
nvidia/tao/tao-toolkit-pyt:
v3.21.11-py3:
docker_registry: nvcr.io
tasks:
1. speech_to_text
2. speech_to_text_citrinet
3. text_classification
4. question_answering
5. token_classification
6. intent_slot_classification
7. punctuation_and_capitalization
8. action_recognition
v3.22.02-py3:
docker_registry: nvcr.io
tasks:
1. spectro_gen
2. vocoder
nvidia/tao/tao-toolkit-lm:
v3.21.08-py3:
docker_registry: nvcr.io
tasks:
1. n_gram
format_version: 2.0
toolkit_version: 3.22.02
published_date: 02/28/2022
• Training spec file(If have, please share here)
experiment spec file
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

I run the command :
!tao yolo_v3 export
-e $SPECS_DIR/experiment_spec_exp.json
-m $USER_EXPERIMENT_DIR/experiment_dir_retrain_qat3/weights/yolov3_resnet18_epoch_080.tlt
-o $USER_EXPERIMENT_DIR/experiment_dir_final/resnet18_detector_qat2.etlt
-k $KEY
#–cal_image_dir /workspace/tao-experiments/try-6/train/
#–cal_data_file /$USER_EXPERIMENT_DIR/experiment_dir_final/calibration_qat.tensorfile
–data_type int8
#–batch_size 8
#–max_batch_size 64
–cal_json_file $USER_EXPERIMENT_DIR/experiment_dir_final/calibration_qat.json
#–verbose

with all the “#” line commented or not I get lots of different errors but in this specific configuration I get:

/home/ilias/anaconda3/envs/launcher/lib/python3.6/site-packages/tlt/init.py:20: DeprecationWarning:
The nvidia-tlt package will be deprecated soon. Going forward please migrate to using the nvidia-tao package.

warnings.warn(message, DeprecationWarning)
2023-04-21 14:18:35,036 [INFO] root: Registry: [‘nvcr.io’]
2023-04-21 14:18:35,071 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-ydv8_bf6 because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Using TensorFlow backend.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
2023-04-21 12:18:38,083 [INFO] root: Building exporter object.
2023-04-21 12:18:39,692 [INFO] root: Exporting the model.
2023-04-21 12:18:39,692 [INFO] root: Using input nodes: [‘Input’]
2023-04-21 12:18:39,692 [INFO] root: Using output nodes: [‘BatchedNMS’]
2023-04-21 12:18:39,692 [INFO] iva.common.export.keras_exporter: Using input nodes: [‘Input’]
2023-04-21 12:18:39,692 [INFO] iva.common.export.keras_exporter: Using output nodes: [‘BatchedNMS’]
The ONNX operator number change on the optimization: 379 → 173
2023-04-21 12:18:54,369 [INFO] keras2onnx: The ONNX operator number change on the optimization: 379 → 173
[TensorRT] ERROR: 1: [caskUtils.cpp::trtSmToCask::114] Error Code 1: Internal Error (Unsupported SM: 0x809)
2023-04-21 12:18:56,130 [ERROR] modulus.export._tensorrt: Failed to create engine
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/export/_tensorrt.py”, line 869, in init
Traceback (most recent call last):
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/export/_tensorrt.py”, line 869, in init
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/scripts/export.py”, line 12, in
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/app.py”, line 265, in launch_export
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/app.py”, line 247, in run_export
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/keras_exporter.py”, line 455, in export
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/export/_tensorrt.py”, line 877, in init
AssertionError: Parsing failed on line 869 in statement
2023-04-21 14:18:57,191 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

This generates the etlt file but not the calibration_files.json to make it work with deepstream, what am I doing wrong ? Is Tao compatible with rtx 4090

Thank you for your help !
Best regards,
Ilias.

Please update to latest tao version. Refer to Migrating from older TLT to TAO Toolkit - NVIDIA Docs

Or use latest docker directly. TAO Toolkit | NVIDIA NGC . For yolov3, use nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5

Thank you for your answer !
I dindn’t know I had a version problem, so I changed to run directly the proper docker container.
So I ran :

docker run -it --rm --gpus all -v .:/workspace nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5 yolo_v3 export -e specs/experiment_spec_exp.json -m yolo_v3/experiment_dir_retrain_qat3/weights/yolov3_resnet18_epoch_080.tlt -o yolo_v3/experiment_dir_final/resnet18_detector_qat2.etlt --data_type int8 -k tlt_encode --cal_json_file yolo_v3/experiment_dir_final/calibration_qat.json

and got the logs and errors:

=== TAO Toolkit TensorFlow ===

NVIDIA Release 4.0.1-TensorFlow (build )
TAO Toolkit Version 4.0.1

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the TAO Toolkit End User License Agreement.
By pulling and using the container, you accept the terms and conditions of this license:

NOTE: The SHMEM allocation limit is set to the default of 64MB. This may be
insufficient for TAO Toolkit. NVIDIA recommends the use of the following flags:
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 …

Using TensorFlow backend.
2023-04-24 09:23:37.071822: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/local/lib/python3.6/dist-packages/requests/init.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn’t match a supported version!
RequestsDependencyWarning)
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/local/lib/python3.6/dist-packages/requests/init.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn’t match a supported version!
RequestsDependencyWarning)
2023-04-24 09:23:41,726 [INFO] iva.common.export.keras_exporter: Using input nodes: [‘Input’]
2023-04-24 09:23:41,726 [INFO] iva.common.export.keras_exporter: Using output nodes: [‘BatchedNMS’]
The ONNX operator number change on the optimization: 379 → 173
2023-04-24 09:23:58,712 [INFO] keras2onnx: The ONNX operator number change on the optimization: 379 → 173
2023-04-24 09:23:59,026 [INFO] iva.common.export.base_exporter: Generating a tensorfile with random tensor images. This may work well as a profiling tool, however, it may result in inaccurate results at inference. Please generate a tensorfile using the tlt-int8-tensorfile, or provide a custom directory of images for best performance.
Traceback (most recent call last):
File “</usr/local/lib/python3.6/dist-packages/iva/yolo_v3/scripts/export.py>”, line 3, in
File “”, line 30, in
File “”, line 14, in
File “”, line 302, in launch_export
File “”, line 284, in run_export
File “”, line 410, in export
File “”, line 198, in get_calibrator
File “”, line 309, in generate_tensor_file
File “”, line 352, in generate_random_tensorfile
File “”, line 54, in init
File “/usr/local/lib/python3.6/dist-packages/h5py/_hl/files.py”, line 312, in init
fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)
File “/usr/local/lib/python3.6/dist-packages/h5py/_hl/files.py”, line 148, in make_fid
fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl)
File “h5py/_objects.pyx”, line 54, in h5py._objects.with_phil.wrapper
File “h5py/_objects.pyx”, line 55, in h5py._objects.with_phil.wrapper
File “h5py/h5f.pyx”, line 98, in h5py.h5f.create
ValueError: Invalid file name (invalid file name)
Telemetry data couldn’t be sent, but the command ran successfully.
[WARNING]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL

is there an other problem ?

Can you double check the path of each file? The path should be a path inside the docker.

docker run -it --rm --gpus all -v .:/workspace nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5 /bin/bash

Then,
ls specs/experiment_spec_exp.json
ls yolo_v3/experiment_dir_retrain_qat3/weights/yolov3_resnet18_epoch_080.tlt


etc

Thank you for your answer! So, all the required files are here:
‘’’
root@7ed655ac235e:/workspace# ls specs/
coco_config.json experiment_spec_Q.json experiment_spec_exp.json
experiment_spec.json experiment_spec_QAT.json
root@7ed655ac235e:/workspace# ls specs/experiment_spec_exp.json
specs/experiment_spec_exp.json
root@7ed655ac235e:/workspace# ls yolo_v3/experiment_dir_retrain_qat3/weights/yolov3_resnet18_epoch_080.tlt
yolo_v3/experiment_dir_retrain_qat3/weights/yolov3_resnet18_epoch_080.tlt
‘’’
What I understood about the other files : yolo_v3/experiment_dir_final/resnet18_detector_qat2.etlt is the etlt file that this command generates (wich it does) and yolo_v3/experiment_dir_final/calibration_qat.json should be the generated calibration weights for int8 inference however the command does not create it.

Did I misunderstand something ? Because, for instance, when you run this command with detectnet, it does generate the calibration file.

Can you refer to the command in latest notebook? GPU-optimized AI, Machine Learning, & HPC Software | NVIDIA NGC

!tao yolo_v3 export -m $USER_EXPERIMENT_DIR/experiment_dir_retrain/weights/yolov3_resnet18_epoch_$EPOCH.tlt \
                    -k $KEY \
                    -o $USER_EXPERIMENT_DIR/export/yolov3_resnet18_epoch_$EPOCH.etlt \
                    -e $SPECS_DIR/yolo_v3_retrain_resnet18_tfrecord.txt \
                    --target_opset 12 \
                    --gen_ds_config

Okay thank you for this ressource it really helped I understood my problem ! I was following a former notebook where the cal.bin file could be generated during the export (with detectnet at least). But now to generate the cal bin file you need to use tao-deploy.

So the process will be first to train with tao,
then export using your command :

!tao yolo_v3 export -m $USER_EXPERIMENT_DIR/experiment_dir_retrain/weights/yolov3_resnet18_epoch_$EPOCH.tlt \
                    -k $KEY \
                    -o $USER_EXPERIMENT_DIR/export/yolov3_resnet18_epoch_$EPOCH.etlt \
                    -e $SPECS_DIR/yolo_v3_retrain_resnet18_tfrecord.txt \
                    --target_opset 12 \
                    --gen_ds_config

then deploy to generate the engine file and the cal.bin file:

!tao-deploy yolo_v3 gen_trt_engine -m $USER_EXPERIMENT_DIR/export/yolov3_resnet18_epoch_$EPOCH.etlt \
                                   -k $KEY \
                                   -e $SPECS_DIR/yolo_v3_retrain_resnet18_tfrecord.txt \
                                   --cal_image_dir $DATA_DOWNLOAD_DIR/testing/image_2 \
                                   --data_type int8 \
                                   --batch_size 16 \
                                   --min_batch_size 1 \
                                   --opt_batch_size 8 \
                                   --max_batch_size 16 \
                                   --batches 10 \
                                   --cal_cache_file $USER_EXPERIMENT_DIR/export/cal.bin  \
                                   --cal_data_file $USER_EXPERIMENT_DIR/export/cal.tensorfile \
                                   --engine_file $USER_EXPERIMENT_DIR/export/trt.engine.int8

however it doesn’t work for me if in the spec file (-e) there’s enable_qat=true but frankly now I’m just happy it works haha

Thank you very much for your help !

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.