Yeah I did that here is the log :
======================
=== TAO Toolkit Deploy ===
NVIDIA Release 4.0.0-Deploy (build 47705558)
TAO Toolkit Version 4.0.0
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the TAO Toolkit End User License Agreement.
By pulling and using the container, you accept the terms and conditions of this license:
INFO:root:The provided .etlt file is in ONNX format.
[07/21/2023-12:32:05] [TRT] [I] [MemUsageChange] Init CUDA: CPU +318, GPU +0, now: CPU 351, GPU 992 (MiB)
[07/21/2023-12:32:06] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +443, GPU +116, now: CPU 848, GPU 1108 (MiB)
[07/21/2023-12:32:06] [TRT] [W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See CUDA_MODULE_LOADING
in CUDA C++ Programming Guide
INFO:nvidia_tao_deploy.cv.yolo_v3.engine_builder:Parsing ONNX model
INFO:nvidia_tao_deploy.cv.yolo_v3.engine_builder:List inputs:
INFO:nvidia_tao_deploy.cv.yolo_v3.engine_builder:Input 0 → Input.
INFO:nvidia_tao_deploy.cv.yolo_v3.engine_builder:[3, 1280, 1280].
INFO:nvidia_tao_deploy.cv.yolo_v3.engine_builder:0.
[07/21/2023-12:32:06] [TRT] [W] parsers/onnx/onnx2trt_utils.cpp:375: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[07/21/2023-12:32:06] [TRT] [W] parsers/onnx/onnx2trt_utils.cpp:403: One or more weights outside the range of INT32 was clamped
[07/21/2023-12:32:06] [TRT] [W] parsers/onnx/onnx2trt_utils.cpp:403: One or more weights outside the range of INT32 was clamped
[07/21/2023-12:32:06] [TRT] [W] parsers/onnx/onnx2trt_utils.cpp:403: One or more weights outside the range of INT32 was clamped
[07/21/2023-12:32:06] [TRT] [W] parsers/onnx/onnx2trt_utils.cpp:403: One or more weights outside the range of INT32 was clamped
[07/21/2023-12:32:06] [TRT] [W] parsers/onnx/onnx2trt_utils.cpp:403: One or more weights outside the range of INT32 was clamped
[07/21/2023-12:32:06] [TRT] [W] parsers/onnx/onnx2trt_utils.cpp:403: One or more weights outside the range of INT32 was clamped
[07/21/2023-12:32:06] [TRT] [W] parsers/onnx/onnx2trt_utils.cpp:403: One or more weights outside the range of INT32 was clamped
[07/21/2023-12:32:06] [TRT] [W] parsers/onnx/onnx2trt_utils.cpp:403: One or more weights outside the range of INT32 was clamped
[07/21/2023-12:32:06] [TRT] [W] parsers/onnx/onnx2trt_utils.cpp:403: One or more weights outside the range of INT32 was clamped
[07/21/2023-12:32:06] [TRT] [W] parsers/onnx/onnx2trt_utils.cpp:403: One or more weights outside the range of INT32 was clamped
[07/21/2023-12:32:06] [TRT] [W] parsers/onnx/onnx2trt_utils.cpp:403: One or more weights outside the range of INT32 was clamped
[07/21/2023-12:32:06] [TRT] [W] parsers/onnx/onnx2trt_utils.cpp:403: One or more weights outside the range of INT32 was clamped
[07/21/2023-12:32:06] [TRT] [I] No importer registered for op: BatchedNMSDynamic_TRT. Attempting to import as plugin.
[07/21/2023-12:32:06] [TRT] [I] Searching for plugin: BatchedNMSDynamic_TRT, plugin_version: 1, plugin_namespace:
[07/21/2023-12:32:06] [TRT] [W] parsers/onnx/builtin_op_importers.cpp:5225: Attribute caffeSemantics not found in plugin node! Ensure that the plugin creator has a default value defined or the engine may fail to build.
[07/21/2023-12:32:06] [TRT] [I] Successfully created plugin: BatchedNMSDynamic_TRT
INFO:nvidia_tao_deploy.cv.yolo_v3.engine_builder:Network Description
INFO:nvidia_tao_deploy.cv.yolo_v3.engine_builder:Input ‘Input’ with shape (-1, 3, 1280, 1280) and dtype DataType.FLOAT
INFO:nvidia_tao_deploy.cv.yolo_v3.engine_builder:Output ‘BatchedNMS’ with shape (-1, 1) and dtype DataType.INT32
INFO:nvidia_tao_deploy.cv.yolo_v3.engine_builder:Output ‘BatchedNMS_1’ with shape (-1, 200, 4) and dtype DataType.FLOAT
INFO:nvidia_tao_deploy.cv.yolo_v3.engine_builder:Output ‘BatchedNMS_2’ with shape (-1, 200) and dtype DataType.FLOAT
INFO:nvidia_tao_deploy.cv.yolo_v3.engine_builder:Output ‘BatchedNMS_3’ with shape (-1, 200) and dtype DataType.FLOAT
INFO:nvidia_tao_deploy.cv.yolo_v3.engine_builder:dynamic batch size handling
INFO:nvidia_tao_deploy.cv.yolo_v3.engine_builder:Calibrating using ImageBatcher
INFO:nvidia_tao_deploy.cv.yolo_v3.engine_builder:[1, 3, 1280, 1280]
[07/21/2023-12:32:07] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +854, GPU +362, now: CPU 1763, GPU 1490 (MiB)
[07/21/2023-12:32:07] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +126, GPU +58, now: CPU 1889, GPU 1548 (MiB)
[07/21/2023-12:32:07] [TRT] [I] Timing cache disabled. Turning it on will improve builder speed.
[07/21/2023-12:32:08] [TRT] [I] Total Activation Memory: 2956434432
[07/21/2023-12:32:08] [TRT] [I] Detected 1 inputs and 4 output network tensors.
[07/21/2023-12:32:08] [TRT] [I] Total Host Persistent Memory: 112528
[07/21/2023-12:32:08] [TRT] [I] Total Device Persistent Memory: 0
[07/21/2023-12:32:08] [TRT] [I] Total Scratch Memory: 16147968
[07/21/2023-12:32:08] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 8 MiB, GPU 60 MiB
[07/21/2023-12:32:08] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 152 steps to complete.
[07/21/2023-12:32:08] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 3.90092ms to assign 14 blocks to 152 nodes requiring 85787648 bytes.
[07/21/2023-12:32:08] [TRT] [I] Total Activation Memory: 85787648
[07/21/2023-12:32:08] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2467, GPU 1860 (MiB)
[07/21/2023-12:32:08] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 2468, GPU 1870 (MiB)
[07/21/2023-12:32:08] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 2467, GPU 1846 (MiB)
[07/21/2023-12:32:08] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 2467, GPU 1854 (MiB)
[07/21/2023-12:32:08] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +81, now: CPU 0, GPU 141 (MiB)
[07/21/2023-12:32:08] [TRT] [W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See CUDA_MODULE_LOADING
in CUDA C++ Programming Guide
[07/21/2023-12:32:08] [TRT] [I] Starting Calibration.
INFO:nvidia_tao_deploy.engine.calibrator:Calibrating image 1 / 7139
[07/21/2023-12:32:08] [TRT] [I] Calibrated batch 0 in 0.0707083 seconds.
INFO:nvidia_tao_deploy.engine.calibrator:Calibrating image 2 / 7139
[07/21/2023-12:32:08] [TRT] [I] Calibrated batch 1 in 0.0706296 seconds.
INFO:nvidia_tao_deploy.engine.calibrator:Calibrating image 3 / 7139
[07/21/2023-12:32:08] [TRT] [I] Calibrated batch 2 in 0.0706181 seconds.
INFO:nvidia_tao_deploy.engine.calibrator:Calibrating image 4 / 7139
[07/21/2023-12:32:08] [TRT] [I] Calibrated batch 3 in 0.0702139 seconds.
INFO:nvidia_tao_deploy.engine.calibrator:Calibrating image 5 / 7139
[07/21/2023-12:32:08] [TRT] [I] Calibrated batch 4 in 0.0701423 seconds.
INFO:nvidia_tao_deploy.engine.calibrator:Calibrating image 6 / 7139
[07/21/2023-12:32:08] [TRT] [I] Calibrated batch 5 in 0.0710811 seconds.
INFO:nvidia_tao_deploy.engine.calibrator:Calibrating image 7 / 7139
[07/21/2023-12:32:08] [TRT] [I] Calibrated batch 6 in 0.0711559 seconds.
INFO:nvidia_tao_deploy.engine.calibrator:Calibrating image 8 / 7139
[07/21/2023-12:32:08] [TRT] [I] Calibrated batch 7 in 0.0705914 seconds.
INFO:nvidia_tao_deploy.engine.calibrator:Calibrating image 9 / 7139
[07/21/2023-12:32:09] [TRT] [I] Calibrated batch 8 in 0.0705384 seconds.
INFO:nvidia_tao_deploy.engine.calibrator:Calibrating image 10 / 7139
[07/21/2023-12:32:09] [TRT] [I] Calibrated batch 9 in 0.0702 seconds.
INFO:nvidia_tao_deploy.engine.calibrator:Calibrating image 11 / 7139
[07/21/2023-12:32:09] [TRT] [I] Calibrated batch 10 in 0.0703424 seconds.
INFO:nvidia_tao_deploy.engine.calibrator:Calibrating image 12 / 7139
[07/21/2023-12:32:09] [TRT] [I] Calibrated batch 11 in 0.0700974 seconds.
INFO:nvidia_tao_deploy.engine.calibrator:Calibrating image 13 / 7139
[07/21/2023-12:32:09] [TRT] [I] Calibrated batch 12 in 0.0711241 seconds.
INFO:nvidia_tao_deploy.engine.calibrator:Calibrating image 14 / 7139
[07/21/2023-12:32:09] [TRT] [I] Calibrated batch 13 in 0.0702954 seconds.
INFO:nvidia_tao_deploy.engine.calibrator:Calibrating image 15 / 7139
[07/21/2023-12:32:09] [TRT] [I] Calibrated batch 14 in 0.0720497 seconds.
INFO:nvidia_tao_deploy.engine.calibrator:Calibrating image 16 / 7139
[07/21/2023-12:32:09] [TRT] [I] Calibrated batch 15 in 0.0711025 seconds.
INFO:nvidia_tao_deploy.engine.calibrator:Calibrating image 17 / 7139
[07/21/2023-12:32:09] [TRT] [I] Calibrated batch 16 in 0.0714033 seconds.
INFO:nvidia_tao_deploy.engine.calibrator:Calibrating image 18 / 7139
[07/21/2023-12:32:09] [TRT] [I] Calibrated batch 17 in 0.0718931 seconds.
INFO:nvidia_tao_deploy.engine.calibrator:Calibrating image 19 / 7139
[07/21/2023-12:32:09] [TRT] [I] Calibrated batch 18 in 0.0700664 seconds.
INFO:nvidia_tao_deploy.engine.calibrator:Calibrating image 20 / 7139
[07/21/2023-12:32:09] [TRT] [I] Calibrated batch 19 in 0.0716921 seconds.
INFO:nvidia_tao_deploy.engine.calibrator:Calibrating image 21 / 7139
[07/21/2023-12:32:10] [TRT] [I] Calibrated batch 20 in 0.0725608 seconds.
[…]
INFO:nvidia_tao_deploy.engine.calibrator:Calibrating image 7139 / 7139
[07/21/2023-12:42:25] [TRT] [I] Calibrated batch 7138 in 0.0701983 seconds.
INFO:nvidia_tao_deploy.engine.calibrator:Finished calibration batches
[07/21/2023-12:42:31] [TRT] [I] Post Processing Calibration data in 6.0277 seconds.
[07/21/2023-12:42:31] [TRT] [I] Calibration completed in 624.323 seconds.
[07/21/2023-12:42:31] [TRT] [I] Writing Calibration Cache for calibrator: TRT-8501-EntropyCalibration2
INFO:nvidia_tao_deploy.engine.calibrator:Writing calibration cache data to: /workspace/yolo_v3/export/cal.bin
[07/21/2023-12:42:31] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 401) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[07/21/2023-12:42:31] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 405) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[07/21/2023-12:42:31] [TRT] [W] Missing scale and zero-point for tensor BatchedNMS, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[07/21/2023-12:42:31] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 2898, GPU 1786 (MiB)
[07/21/2023-12:42:31] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 2898, GPU 1794 (MiB)
[07/21/2023-12:42:31] [TRT] [I] Local timing cache in use. Profiling results in this builder pass will not be stored.
[07/21/2023-12:43:48] [TRT] [I] Some tactics do not have sufficient workspace memory to run. Increasing workspace size will enable more tactics, please check verbose output for requested sizes.
[07/21/2023-12:47:50] [TRT] [I] Total Activation Memory: 2284631552
[07/21/2023-12:47:50] [TRT] [I] Detected 1 inputs and 4 output network tensors.
[07/21/2023-12:47:50] [TRT] [I] Total Host Persistent Memory: 140688
[07/21/2023-12:47:50] [TRT] [I] Total Device Persistent Memory: 0
[07/21/2023-12:47:50] [TRT] [I] Total Scratch Memory: 12922368
[07/21/2023-12:47:50] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 20 MiB, GPU 1113 MiB
[07/21/2023-12:47:50] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 70 steps to complete.
[07/21/2023-12:47:50] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 1.25284ms to assign 7 blocks to 70 nodes requiring 27386368 bytes.
[07/21/2023-12:47:50] [TRT] [I] Total Activation Memory: 27386368
[07/21/2023-12:47:51] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2938, GPU 1840 (MiB)
[07/21/2023-12:47:51] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 2938, GPU 1848 (MiB)
[07/21/2023-12:47:51] [TRT] [W] TensorRT encountered issues when converting weights between types and that could affect accuracy.
[07/21/2023-12:47:51] [TRT] [W] If this is not the desired behavior, please modify the weights or retrain with regularization to adjust the magnitude of the weights.
[07/21/2023-12:47:51] [TRT] [W] Check verbose logs for the list of affected weights.
[07/21/2023-12:47:51] [TRT] [W] - 49 weights are affected by this issue: Detected subnormal FP16 values.
[07/21/2023-12:47:51] [TRT] [W] - 10 weights are affected by this issue: Detected values less than smallest positive FP16 subnormal value and converted them to the FP16 minimum subnormalized value.
[07/21/2023-12:47:51] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +14, GPU +21, now: CPU 14, GPU 21 (MiB)
INFO:root:Export finished successfully.
2023-07-21 12:47:51,621 [INFO] nvidia_tao_deploy.cv.common.entrypoint.entrypoint_proto: Sending telemetry data.
2023-07-21 12:47:52,199 [WARNING] nvidia_tao_deploy.cv.common.entrypoint.entrypoint_proto: Telemetry data couldn’t be sent, but the command ran successfully.
2023-07-21 12:47:52,199 [WARNING] nvidia_tao_deploy.cv.common.entrypoint.entrypoint_proto: [Error]: init() missing 4 required positional arguments: ‘code’, ‘msg’, ‘hdrs’, and ‘fp’
2023-07-21 12:47:52,199 [INFO] nvidia_tao_deploy.cv.common.entrypoint.entrypoint_proto: Execution status: PASS