TLT Converter Fails

Hi,

I am trying to use tlt-converter to convert the model using QAT Workflow specified in DetectNet_v2 example of this doc - https://docs.nvidia.com/metropolis/TLT/tlt-getting-started-guide/text/overview.html

Here is what I used and am using -
For training, I used
Cloud Environment: Google Cloud and deployed this container (Transfer Learning Toolkit for Video Streaming Analytics | NVIDIA NGC)
Cloud Hardware Setup: NVIDIA GPU (I believe it is T4). 64GB RAM. 1TB Storage

Local Setup: Xavier NX
Clock Mode: Mode 15W 6Core
I was able to go through the entire steps of 11.QAT workflow. After that, I grabbed the exported model and moved it to Xavier NX to run the tlt-converter, I got this error (the error happened even on just regular int8 mode using Step 10) -

[INFO] Reading Calibration Cache for calibrator: EntropyCalibration2
[INFO] Generated calibration scales using calibration cache. Make sure that calibration cache has latest scales.
[INFO] To regenerate calibration cache, please delete the existing one. TensorRT will generate a new calibration cache.
[WARNING] Missing dynamic range for tensor output_bbox/BiasAdd, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[WARNING] Missing dynamic range for tensor output_cov/BiasAdd, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[WARNING] Missing dynamic range for tensor output_cov/Sigmoid, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[INFO] 
[INFO] --------------- Layers running on DLA: 
[INFO] 
[INFO] --------------- Layers running on GPU: 
[INFO] conv1/convolution + activation_1/Relu6, block_1a_conv_1/convolution + block_1a_relu_1/Relu6, block_1a_conv_2/convolution, block_1a_conv_shortcut/convolution + add_1/add + block_1a_relu/Relu6, block_1b_conv_1/convolution + block_1b_relu_1/Relu6, block_1b_conv_2/convolution + add_2/add + block_1b_relu/Relu6, block_2a_conv_1/convolution + block_2a_relu_1/Relu6, block_2a_conv_2/convolution, block_2a_conv_shortcut/convolution + add_3/add + block_2a_relu/Relu6, block_2b_conv_1/convolution + block_2b_relu_1/Relu6, block_2b_conv_2/convolution + add_4/add + block_2b_relu/Relu6, block_3a_conv_1/convolution + block_3a_relu_1/Relu6, block_3a_conv_2/convolution, block_3a_conv_shortcut/convolution + add_5/add + block_3a_relu/Relu6, block_3b_conv_1/convolution + block_3b_relu_1/Relu6, block_3b_conv_2/convolution + add_6/add + block_3b_relu/Relu6, block_4a_conv_1/convolution + block_4a_relu_1/Relu6, block_4a_conv_2/convolution, block_4a_conv_shortcut/convolution + add_7/add + block_4a_relu/Relu6, block_4b_conv_1/convolution + block_4b_relu_1/Relu6, block_4b_conv_2/convolution + add_8/add + block_4b_relu/Relu6, output_cov/convolution, output_cov/Sigmoid, output_bbox/convolution, 
[INFO] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
Killed

The same error popped up when I was doing tlt-converter on Step 9B (on Google Cloud) shown in the Jupyter notebook, but it worked without a problem. However on NX, I couldn’t really figure out how to make it work or what is causing this issue.

Would appreciate any guidance/help.

Thanks

The tlt-converter runs into “KIlled”. I am afraid it is due to lack of workspace memory.
Please set the “-w” option, for example, -w 1000000000
You can also add verbose option in the command line to check more log.

Thanks for the quick response @Morganh

based on what I saw on ./tlt-converter -h I turned on v flag to get verbose, but that doesn’t seem to be very helpful

I ended up running the script with -w and the command was this -
sudo ./tlt-converter -k API_KEY -d 3,720,1280 -o output_bbox/BiasAdd,output_cov/Sigmoid -i nchw -m 64 -t int8 -e ~/Downloads/resnet18_detector.trt -c ~/Downloads/calibration_qat.bin ~/Downloads/resnet18_detector_qat.etlt -w 4294967296

I figured that was in bytes so I thought i was using 4GB. When I dod that. Still got some error but figured out that it was because I was using -m 64 (which was fine for GCP but not for NX). Updated that to 16 and it worked.

As always, thank you!