TLT YOLOv3 Int8 can not detect anything

Hi,
I’m using Nvidia TLT to train my own YoloV3 model and can succesfully get good result with FP16 mode, but when export Int8 mode by tlt-export, and use .etlt and cal.bin file to Deepstream, it can not detect anything. I attach my cal.bin file, Deepstream config, image detect by fp16 and int8 mode. I aslo try to generate Int8 engine by tlt-converter but no luck.
cal.bin.txt (9.0 KB)
dstest3_pgie_config.txt (1.3 KB)
labels.txt (28 Bytes)
yolo_retrain_resnet18_kitti.txt (2.0 KB)

Please set to

cluster-mode=3

and retry.

I tried, but still nothing detected.

Suggest to using tlt-infer inside the docker to verify int8 engine firstly.
To check if its output is good.
You can refer to the commands inside jupyter notebook.

!tlt-converter -k $KEY
-d 3,384,1248
-o BatchedNMS
-c $USER_EXPERIMENT_DIR/export/cal.bin
-e $USER_EXPERIMENT_DIR/export/trt.engine
-b 8
-m 1
-t int8
-i nchw
$USER_EXPERIMENT_DIR/export/yolo_resnet18_epoch_$EPOCH.etlt

!tlt-infer yolo --trt -p $USER_EXPERIMENT_DIR/export/trt.engine
-e $SPECS_DIR/yolo_retrain_resnet18_kitti.txt
-i /workspace/examples/yolo/test_samples
-o $USER_EXPERIMENT_DIR/yolo_infer_images
-t 0.6

Hi, it the same, nothing detected, here is command used for tlt-export

!tlt-export yolo -m $USER_EXPERIMENT_DIR/experiment_dir_retrain/weights/yolo_resnet18_epoch_150.tlt
-o $USER_EXPERIMENT_DIR/export/yolo_resnet18_int8__epoch_150.etlt
-e $USER_EXPERIMENT_DIR/yolo_retrain_resnet18_kitti.txt
-k $KEY
–cal_image_dir $USER_EXPERIMENT_DIR/data/testing/image_2
–data_type int8
–batch_size 1
–batches 10
–cal_cache_file $USER_EXPERIMENT_DIR/export/cal.bin
–cal_data_file $USER_EXPERIMENT_DIR/export/cal.tensorfile

and tlt-export log:

Using TensorFlow backend.
2020-07-31 07:39:40,628 [INFO] /usr/local/lib/python2.7/dist-packages/iva/yolo/utils/spec_loader.pyc: Merging specification from /workspace/tlt-experiments/yolo/yolo_retrain_resnet18_kitti.txt
2020-07-31 07:39:44,234 [INFO] /usr/local/lib/python2.7/dist-packages/iva/yolo/utils/spec_loader.pyc: Merging specification from /workspace/tlt-experiments/yolo/yolo_retrain_resnet18_kitti.txt
NOTE: UFF has been tested with TensorFlow 1.14.0.
WARNING: The version of TensorFlow installed on this system is not guaranteed to work with UFF.
Warning: No conversion function registered for layer: BatchedNMS_TRT yet.
Converting BatchedNMS as custom op: BatchedNMS_TRT
Warning: No conversion function registered for layer: ResizeNearest_TRT yet.
Converting upsample1/ResizeNearestNeighbor as custom op: ResizeNearest_TRT
Warning: No conversion function registered for layer: ResizeNearest_TRT yet.
Converting upsample0/ResizeNearestNeighbor as custom op: ResizeNearest_TRT
Warning: No conversion function registered for layer: BatchTilePlugin_TRT yet.
Converting FirstDimTile_2 as custom op: BatchTilePlugin_TRT
Warning: No conversion function registered for layer: BatchTilePlugin_TRT yet.
Converting FirstDimTile_1 as custom op: BatchTilePlugin_TRT
Warning: No conversion function registered for layer: BatchTilePlugin_TRT yet.
Converting FirstDimTile_0 as custom op: BatchTilePlugin_TRT
DEBUG [/usr/lib/python2.7/dist-packages/uff/converters/tensorflow/converter.py:96] Marking [‘BatchedNMS’] as outputs
2020-07-31 07:40:01,000 [WARNING] modulus.export._tensorrt: Calibration file /workspace/tlt-experiments/yolo/export/cal.bin exists but is being ignored.
[TensorRT] INFO: Detected 1 inputs and 4 output network tensors.
[TensorRT] WARNING: Current optimization profile is: 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT] INFO: Starting Calibration with batch size 1.
DEPRECATED: This variant of get_batch is deprecated. Please use the single argument variant described in the documentation instead.
[TensorRT] INFO: Calibrated batch 0 in 0.123853 seconds.
[TensorRT] INFO: Calibrated batch 1 in 0.111234 seconds.
[TensorRT] INFO: Calibrated batch 2 in 0.110646 seconds.
[TensorRT] INFO: Calibrated batch 3 in 0.110725 seconds.
[TensorRT] INFO: Calibrated batch 4 in 0.11233 seconds.
[TensorRT] INFO: Calibrated batch 5 in 0.111406 seconds.
[TensorRT] INFO: Calibrated batch 6 in 0.114801 seconds.
[TensorRT] INFO: Calibrated batch 7 in 0.121879 seconds.
[TensorRT] INFO: Calibrated batch 8 in 0.123058 seconds.
[TensorRT] INFO: Calibrated batch 9 in 0.123246 seconds.
[TensorRT] WARNING: Tensor BatchedNMS is uniformly zero; network calibration failed.
[TensorRT] WARNING: Tensor BatchedNMS_1 is uniformly zero; network calibration failed.
[TensorRT] WARNING: Tensor BatchedNMS_2 is uniformly zero; network calibration failed.
[TensorRT] INFO: Post Processing Calibration data in 5.69181 seconds.
[TensorRT] INFO: Calibration completed in 34.6949 seconds.
2020-07-31 07:40:35,756 [WARNING] modulus.export._tensorrt: Calibration file /workspace/tlt-experiments/yolo/export/cal.bin exists but is being ignored.
[TensorRT] INFO: Writing Calibration Cache for calibrator: TRT-7000-EntropyCalibration2
2020-07-31 07:40:35,757 [INFO] modulus.export._tensorrt: Saving calibration cache (size 9237) to /workspace/tlt-experiments/yolo/export/cal.bin
[TensorRT] WARNING: Rejecting int8 implementation of layer BatchedNMS due to missing int8 scales, will choose a non-int8 implementation.
[TensorRT] INFO: Detected 1 inputs and 4 output network tensors.

Please double check if tlt-infer can have a good output (check the output in $USER_EXPERIMENT_DIR/yolo_infer_images).
Above is just tlt-export log.

More, if you still find tlt-infer has not a good output, there should be something wrong in cal.bin or others. Please set "–-batches " to at least 10% of your training dataset and generate cal.bin again.

hi, how many images in $USER_EXPERIMENT_DIR/data/testing/image_2 is suitable for calibration? and those images have the same size of training images, right?
my training dataset is about 7000 images, so the --batches is 700, right?

Yes, if your training dataset folder is xxx, totally 7000, then set

–cal_image_dir xxx
–batch_size 1
–batches 700(or larger)

I tried many times, even with --batches 7000 but the result is the same, but I see the generated cal.bin with the calibrator file in objectDetector_Yolo sample in Deepstream are different format, is it OK or something wrong?

cal.bin generated by tlt-export:

TRT-7000-EntropyCalibration2
Input: 3c063c76
conv1/convolution: 3d2cbbb4
bn_conv1/batchnorm/mul_1: 39fd4957
bn_conv1/batchnorm/add_1: 3cb24cfc
activation_3/Relu: 3cb24cfc
block_1a_conv_1/convolution: 3d0cb4e6
block_1a_bn_1/batchnorm/mul_1: 3cf1a6df
block_1a_bn_1/batchnorm/add_1: 3c40bc7c
block_1a_relu_1/Relu: 3c409159
block_1a_conv_2/convolution: 3cb3bacd
block_1a_bn_2/batchnorm/mul_1: 3d2981b5
block_1a_bn_2/batchnorm/add_1: 3c09a7ae
block_1a_conv_shortcut/convolution: 3c5c6a51
block_1a_bn_shortcut/batchnorm/mul_1: 3cbd0059
block_1a_bn_shortcut/batchnorm/add_1: 3c1c4c5c
add_17/add: 3c6f6fd9

calibrator file in Deepstream yolo example:

TRT-7000-EntropyCalibration2
data: 3c008912
(Unnamed Layer* 0) [Convolution]_output: 3c575e9d
(Unnamed Layer* 1) [Scale]_output: 3da4d54d
(Unnamed Layer* 2) [Activation]_output: 3d5f264f
(Unnamed Layer* 3) [Convolution]_output: 3e0e6205
(Unnamed Layer* 4) [Scale]_output: 3dad4779
(Unnamed Layer* 5) [Activation]_output: 3d8afc75
(Unnamed Layer* 6) [Convolution]_output: 3d9c79b6
(Unnamed Layer* 7) [Scale]_output: 3d5d1828
(Unnamed Layer* 8) [Activation]_output: 3d2922e4
(Unnamed Layer* 9) [Convolution]_output: 3dc13536
(Unnamed Layer* 10) [Scale]_output: 3e113077
(Unnamed Layer* 11) [Activation]_output: 3d3ed7c6
(Unnamed Layer* 12) [ElementWise]_output: 3d9cbb9b
(Unnamed Layer* 13) [Convolution]_output: 3e62a78d
(Unnamed Layer* 14) [Scale]_output: 3d8b0375
(Unnamed Layer* 15) [Activation]_output: 3d7152cd
(Unnamed Layer* 16) [Convolution]_output: 3d1d0528
(Unnamed Layer* 17) [Scale]_output: 3d36d309

There is no update from you for a period, assuming this is not an issue any more.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

Could you try to use the default yolo_v2 jupyter notebook and train the KITTI dataset?
Just to narrow down the issue to see if int8 engine can work with different dataset.

Having the same issue here. My TLT YoloV3 in FP16 works well with the tlt_deepstream_apps deepstream-custom app.
Though when I supply the calibration file generated at the export, the model still loads and compiles but no detections are being made.

An update about this.
The first scenario where I encountered this issue is as follows:
Train YoloV3 with Darknet53, no QAT in spec -> Prune -> Retrain with QAT in spec -> INT8 QAT export fail

I then did the following and was able to export the weights saved during QAT training
Train YoloV3 with MobileNetV2, with QAT in spec -> Prune -> Retrain with QAT in spec -> INT8 QAT export success

Could it be either the Darknet53 backbone not supporting any QAT operations or QAT not having been used pre-pruning?

Hi @roulbac,
How about running with default KITTI dataset? Will it run into the same issue you mentioned?
BTW, how about below, is it successful or failed?
Train YoloV3 with Darknet53, with QAT in spec -> Prune -> Retrain with QAT in spec -> INT8 QAT export

@Morganh I tested again, and training with QAT before and after pruning allows for successful QAT export

@roulbac,
Please create a new forum topic if there is further issue.
Yours is different from @phamngan150893.

@morganh you’re right, I confused this post with the following

What happened is I tried QAT export first, and encountered the error in the post above.
Then I exported the same model, though specifying the calibration images folder, without errors.
The model I then calibrate using that cache didn’t output any predictions versus the same model in FP16.

@roulbac
I suggest you running the default yolo_v3 jupyter notebook along with KITTI dataset.
To check if there is the same issue.