PeopleNet int8 shows small improvement over fp16

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU) GPU
• DeepStream Version 5.1
• JetPack Version (valid for Jetson only)
• TensorRT Version 7.2
• NVIDIA GPU Driver Version (valid for GPU only) 460.80
• Issue Type( questions, new requirements, bugs) Question
• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)

I am running peoplenet directly out of the 5.1-21.02-devel container using the included deepstream_app_source1_peoplenet.txt. On our test videos, we have found that using the int8 quantized model over fp16 only yields about a 10% increase in fps on a single stream. This seems low but perhaps my expectations are incorrect - I was expecting a much bigger performance increase for int8. I have the following questions:

  1. Is this the expected performance increase for fp16 vs int8 on a single stream? If not, how much is expected?
  2. If so, is there a bigger performance increase to be expected from batching? If so, how much?

I am including my pgie configuration for int8 below for diagnostic purposes. In the fp16 setting, I am using the default config_infer_primary_peoplenet.txt included in the container with the fp16 pruned model. For int8, I am using the below config with the quantized pruned model.

[property]
gpu-id=0
net-scale-factor=0.0039215697906911373
tlt-model-key=tlt_encode
tlt-encoded-model=../../../models/tlt_pretrained_models/peoplenet-int8/resnet34_peoplenet_pruned_int8.etlt
labelfile-path=../../../models/tlt_pretrained_models/peoplenet-int8/labels.txt
model-engine-file=../../../models/tlt_pretrained_models/peoplenet-int8/resnet34_peoplenet_pruned_int8.etlt_b1_gpu0_int8.engine
int8-calib-file=../../../models/tlt_pretrained_models/peoplenet-int8/resnet34_peoplenet_pruned_int8_gpu.txt
input-dims=3;544;960;0
uff-input-blob-name=input_1
batch-size=1
process-mode=1
model-color-format=0
## 0=FP32, 1=INT8, 2=FP16 mode
network-mode=1
num-detected-classes=3
cluster-mode=1
interval=0
gie-unique-id=1
output-blob-names=output_bbox/BiasAdd;output_cov/Sigmoid

[class-attrs-all]
pre-cluster-threshold=0.4
## Set eps=0.7 and minBoxes for cluster-mode=1(DBSCAN)
eps=0.7
minBoxes=1

Hi @admayber ,
Sorry for delay!

What’s the GPU?

Could you use trtexec to profile the TensorRT engine generated by DeepStream, e.g. resnet34_peoplenet_pruned_int8.etlt_b1_gpu0_int8.engine

command
$ trtexec --loadEngine=resnet34_peoplenet_pruned_int8.etlt_b1_gpu0_int8.engine

and share the log

I have attached the output from the following commands:

trtexec --loadEngine=resnet34_peoplenet_pruned_int8.etlt_b1_gpu0_int8.engine --int8 --calib=resnet34_peoplenet_pruned_int8_gpu.txt
trtexec --loadEngine=resnet34_peoplenet_pruned.etlt_b1_gpu0_fp16.engine --fp16

trtexec-int8.log (54.5 KB)
trtexec-fp16.log (35.7 KB)

From the log, INT8 infer time (mean: 0.829047 ms) is much shorter than fp16 infer time (mean: 1.32287 ms), at least, with int8, the infer time is improved much higher than 10%.
So, I think, why you only got 10% improvement with INT8 is because the infer time is very small part of the whole pipeline, and the time of the other part, e.g. decoding, pre-&post-processing, are not changes, so you get less perf improvement with INT8.

Could you increase the “batch-size”, e.g. batch-size=10 and check the fps of INT8 and FP16?

With batch size = 10 I see a much a bigger difference in the TRT results.

int8 = 4.84358 ms mean
fp16 = 9.39056 ms mean

However, when I run the actual deepstream pipeline with 10 streams, the performance of fp16 and int8 is almost identical - no difference at all. This indicates that you are correct and there is probably some bottleneck in the pipeline, but I’m not sure where it could be. I have disabled everything except the source, the streammux, and the nvinfer element.

Here is my config for fp16:
nvidia-test-config.txt (3.6 KB)
config_infer_primary_peoplenet_fp16.txt (2.0 KB)

pgie config for int8:
config_infer_primary_peoplenet_int8.txt (2.1 KB)

please replace “type=2” to “type=1” to let the pipeline “free run”, otherwise, with “type=2”, pipeline will be always sync-ed by display/EglSink with a fixed frequency, e.g. 60fps.

[sink0]
enable=0
#Type - 1=FakeSink 2=EglSink 3=File
type=2
sync=1
source-id=0
gpu-id=0

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.