TLT Converter Fails

a428tm · December 14, 2020, 1:26pm

Hi,

I am trying to use tlt-converter to convert the model using QAT Workflow specified in DetectNet_v2 example of this doc - https://docs.nvidia.com/metropolis/TLT/tlt-getting-started-guide/text/overview.html

Here is what I used and am using -
For training, I used
Cloud Environment: Google Cloud and deployed this container (Transfer Learning Toolkit for Video Streaming Analytics | NVIDIA NGC)
Cloud Hardware Setup: NVIDIA GPU (I believe it is T4). 64GB RAM. 1TB Storage

Local Setup: Xavier NX
Clock Mode: Mode 15W 6Core
I was able to go through the entire steps of 11.QAT workflow. After that, I grabbed the exported model and moved it to Xavier NX to run the tlt-converter, I got this error (the error happened even on just regular int8 mode using Step 10) -

[INFO] Reading Calibration Cache for calibrator: EntropyCalibration2
[INFO] Generated calibration scales using calibration cache. Make sure that calibration cache has latest scales.
[INFO] To regenerate calibration cache, please delete the existing one. TensorRT will generate a new calibration cache.
[WARNING] Missing dynamic range for tensor output_bbox/BiasAdd, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[WARNING] Missing dynamic range for tensor output_cov/BiasAdd, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[WARNING] Missing dynamic range for tensor output_cov/Sigmoid, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[INFO] 
[INFO] --------------- Layers running on DLA: 
[INFO] 
[INFO] --------------- Layers running on GPU: 
[INFO] conv1/convolution + activation_1/Relu6, block_1a_conv_1/convolution + block_1a_relu_1/Relu6, block_1a_conv_2/convolution, block_1a_conv_shortcut/convolution + add_1/add + block_1a_relu/Relu6, block_1b_conv_1/convolution + block_1b_relu_1/Relu6, block_1b_conv_2/convolution + add_2/add + block_1b_relu/Relu6, block_2a_conv_1/convolution + block_2a_relu_1/Relu6, block_2a_conv_2/convolution, block_2a_conv_shortcut/convolution + add_3/add + block_2a_relu/Relu6, block_2b_conv_1/convolution + block_2b_relu_1/Relu6, block_2b_conv_2/convolution + add_4/add + block_2b_relu/Relu6, block_3a_conv_1/convolution + block_3a_relu_1/Relu6, block_3a_conv_2/convolution, block_3a_conv_shortcut/convolution + add_5/add + block_3a_relu/Relu6, block_3b_conv_1/convolution + block_3b_relu_1/Relu6, block_3b_conv_2/convolution + add_6/add + block_3b_relu/Relu6, block_4a_conv_1/convolution + block_4a_relu_1/Relu6, block_4a_conv_2/convolution, block_4a_conv_shortcut/convolution + add_7/add + block_4a_relu/Relu6, block_4b_conv_1/convolution + block_4b_relu_1/Relu6, block_4b_conv_2/convolution + add_8/add + block_4b_relu/Relu6, output_cov/convolution, output_cov/Sigmoid, output_bbox/convolution, 
[INFO] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
Killed

The same error popped up when I was doing tlt-converter on Step 9B (on Google Cloud) shown in the Jupyter notebook, but it worked without a problem. However on NX, I couldn’t really figure out how to make it work or what is causing this issue.

Would appreciate any guidance/help.

Thanks

Morganh · December 14, 2020, 2:50pm

The tlt-converter runs into “KIlled”. I am afraid it is due to lack of workspace memory.
Please set the “-w” option, for example, -w 1000000000
You can also add verbose option in the command line to check more log.

a428tm · December 14, 2020, 3:37pm

Thanks for the quick response @Morganh

based on what I saw on ./tlt-converter -h I turned on v flag to get verbose, but that doesn’t seem to be very helpful

I ended up running the script with -w and the command was this -
sudo ./tlt-converter -k API_KEY -d 3,720,1280 -o output_bbox/BiasAdd,output_cov/Sigmoid -i nchw -m 64 -t int8 -e ~/Downloads/resnet18_detector.trt -c ~/Downloads/calibration_qat.bin ~/Downloads/resnet18_detector_qat.etlt -w 4294967296

I figured that was in bytes so I thought i was using 4GB. When I dod that. Still got some error but figured out that it was because I was using -m 64 (which was fine for GCP but not for NX). Updated that to 16 and it worked.

As always, thank you!

Topic		Replies	Views
TLT-converter for HeartRateNet model error TAO Toolkit	20	888	September 27, 2021
Tlt-converter throws error 'std::invalid_argument' TAO Toolkit	9	1458	October 12, 2021
Classification TF_2 QAT - Calibration files? TAO Toolkit	9	378	June 4, 2023
Error at exporting to TRT engine in TLT TAO Toolkit	7	929	August 24, 2021
Invalid decryption. Unable to open file (file signature not found). The key used to load the model is incorrect TAO Toolkit	3	667	October 12, 2021
[TensorRT] ERROR: ../rtSafe/safeRuntime.cpp (25) - Cuda Error in allocate: 2 (out of memory) TAO Toolkit tensorrt	2	1055	October 12, 2021
TLT Converter UffParser: Unsupported number of graph 0 TAO Toolkit	4	2494	October 12, 2021
Cannot run model exported from TLT on Jetson's DLA TAO Toolkit tensorrt	7	444	October 12, 2021
Error while converting .etlt model to .trt model TAO Toolkit tensorrt , tao	17	720	July 7, 2023
Error while coverting the mode TAO Toolkit	6	340	September 5, 2021

TLT Converter Fails

Related topics