Wrong inference results with etlt or engine, Faster RCNN in TLT 3.0 batch size > 1

I’m having some problems with TLT 3.0 (not TAO) running inside the Docker, inference of Faster-RCNN with exported etlt doesn’t work correctly if I increase the batch size. With batch size > 1, only the results for the 1st image is correct, all other results from the batch are incorrect.
This happens only after exporting to etlt (and consequently with the created engine file, or with the engine created while exporting), using the model from the tlt files works well.
Inferences with TLT 2.0 from my previous trainings with Faster-RCNN worked with any batch size.

Please provide the following information when requesting support.

• Hardware: Quadro P5000 (fp32) nvidia-smi.txt (1.7 KB)

• Network Type: Faster_rcnn
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here): TLT 3.0 (docker from nvcr.io/nvidia/tlt-streamanalytics:v3.0-dp-py3)
• Training spec file default_spec_resnet18_retrain_spec.txt (5.8 KB)
• How to reproduce the issue ?
Jupyter Notebook running at Docker TLT 3.0
Exporting etlt:
!faster_rcnn export --gpu_index $GPU_INDEX -m $USER_EXPERIMENT_DIR/2-retrained_pruned/frcnn_pruned_resnet18_retrain.epoch12.tlt
-o $USER_EXPERIMENT_DIR/2-retrained_pruned/frcnn_pruned_resnet18_retrain.fp32.etlt
-e $SPECS_DIR/default_spec_resnet18_retrain_spec.txt
-k $KEY
–data_type fp32
–engine_file $USER_EXPERIMENT_DIR/2-retrained_pruned/frcnn_pruned_resnet18_retrain.fp32.trt-tlt.engine
log_tlt_export_command.txt (19.7 KB)

Inference:
!faster_rcnn inference --gpu_index $GPU_INDEX -e $SPECS_DIR/default_spec_resnet18_retrain_spec.txt
1st image:
538_2
2nd image:
538_1

How did you generate the trt engine?
In Tao v3.21.08 docker, I cannot reproduce your result. I set batch_size to 2 in the spec file.

tao-converter -k nvidia_tlt -d 3,384,1248 -o NMS -e trt_fp16_m2.engine -t fp16 -m 2 frcnn_kitti_efficientnet_b1.epoch3.etlt

I’ve tried both ways, directly on the export command, and with the converter after having exported the ETLT:

Exporter:
!faster_rcnn export --gpu_index $GPU_INDEX -m $USER_EXPERIMENT_DIR/2-retrained_pruned/frcnn_pruned_resnet18_retrain.epoch12.tlt
-o $USER_EXPERIMENT_DIR/4-retrained_pruned_qat/frcnn_pruned_resnet18_retrain.fp32.etlt
-e $SPECS_DIR/default_spec_resnet18_retrain_spec.txt
-k $KEY
–data_type fp32
–engine_file $USER_EXPERIMENT_DIR/2-retrained_pruned/frcnn_pruned_resnet18_retrain.fp32.trt-tlt.engine

Converter:
!CUDA_VISIBLE_DEVICES=$GPU_INDEX tlt-converter -k $KEY
-d 3,480,640
-o NMS
-e $USER_EXPERIMENT_DIR/2-retrained_pruned/frcnn_resnet18_train2_detector_fp32-batch4.trt-tlt.engine
-m 4
-t fp32
-i nchw
$USER_EXPERIMENT_DIR/2-retrained_pruned/frcnn_pruned_resnet18_retrain.fp32.etlt

After setting batch size to 2 on spec file, did you commented the “model” parameter in inference and de-commented the “trt_engine” section and parameters?
Because if this is not done, inference model is created directly from the TLT file and this works well for me too.

After generating the tensorrt engine, I set it into “trt_engine” section. So, the inference is using this tensorrt engine.

Could you try with Resnet18 as backbone? Just to be sure.
(not directly related) May I export/create the etlt file on the latest TAO docker using the tlt model files (those generated during training) from TLT 3.0?

Yes, you can.

Well, I retrained my model with latest TAO Toolkit (v3.21.08), and now inferences on Jupyter Notebook, using engine file and batch size > 1 do work.

It seems that inferences using ETLT (before conversion) are no longer supported on TAO.

But I still have a problem when I convert the ETLT with both converters for TensorRT 7.1, which is the installed version on my training/testing PC (and on my Jetson AGX Xavier with Jetpack 4.5.1), [cuda102-cudnn80-trt71] / [cuda110-cudnn80-trt71]. My python script for inference works inside the docker used on TAO 3.21.08, but not outside the docker (then I get detections like the images from the initial post for all but the firs image from batch).

So I assume there is either a problem with the conversor for TensorRT 7.1, or models trained on TensorRT 7.2 (from TLT 3.0 or TAO 3.0) doesn’t work on TensorRT 7.1.

If it’s supposed to work on TensorRT 7.1, could you please test if inferences from your training using batch size > 1 works on an environment with TensorRT 7.1?

If you run your inference script on Jetson AGX Xavier, please copy etlt file into Xavier. And then in Xaiver, download the tao-converter for Jetson version, and generate trt engine in Xaiver.

I’ve updated my Jetson’s JetPack to version 4.6 and also the training/testing PC’s TensorRT to v8.0.3.4 and still have the same problem (using theirs corresponding tao-converters).
I’ll try to use DeepStream (never used before) to run inferences with batch size greater than 1 on Jetson with it’s TensorPack’s TensorRT v8.0.1 and see if it work.

Hi,
For TAO 3.0 docker, there is no issue.
For TLT 2.0 docker, there is no NMS plugin, so there is no problem. Can work for batch size >1 .
For TLT 3.0 docker, it cannot work for batch size > 1. please rebuild a new Tensorrt OSS plugin.

You can build on Tensorrt 21.04 branch, it can work for tlt 3.0.

For Jetson devices, need to build a new tensorrt oss plugin as well.