PeopleNet int8 shows small improvement over fp16

admayber · June 21, 2021, 11:30pm

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU) GPU
• DeepStream Version 5.1
• JetPack Version (valid for Jetson only)
• TensorRT Version 7.2
• NVIDIA GPU Driver Version (valid for GPU only) 460.80
• Issue Type( questions, new requirements, bugs) Question
• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)

I am running peoplenet directly out of the 5.1-21.02-devel container using the included deepstream_app_source1_peoplenet.txt. On our test videos, we have found that using the int8 quantized model over fp16 only yields about a 10% increase in fps on a single stream. This seems low but perhaps my expectations are incorrect - I was expecting a much bigger performance increase for int8. I have the following questions:

Is this the expected performance increase for fp16 vs int8 on a single stream? If not, how much is expected?
If so, is there a bigger performance increase to be expected from batching? If so, how much?

I am including my pgie configuration for int8 below for diagnostic purposes. In the fp16 setting, I am using the default config_infer_primary_peoplenet.txt included in the container with the fp16 pruned model. For int8, I am using the below config with the quantized pruned model.

[property]
gpu-id=0
net-scale-factor=0.0039215697906911373
tlt-model-key=tlt_encode
tlt-encoded-model=../../../models/tlt_pretrained_models/peoplenet-int8/resnet34_peoplenet_pruned_int8.etlt
labelfile-path=../../../models/tlt_pretrained_models/peoplenet-int8/labels.txt
model-engine-file=../../../models/tlt_pretrained_models/peoplenet-int8/resnet34_peoplenet_pruned_int8.etlt_b1_gpu0_int8.engine
int8-calib-file=../../../models/tlt_pretrained_models/peoplenet-int8/resnet34_peoplenet_pruned_int8_gpu.txt
input-dims=3;544;960;0
uff-input-blob-name=input_1
batch-size=1
process-mode=1
model-color-format=0
## 0=FP32, 1=INT8, 2=FP16 mode
network-mode=1
num-detected-classes=3
cluster-mode=1
interval=0
gie-unique-id=1
output-blob-names=output_bbox/BiasAdd;output_cov/Sigmoid

[class-attrs-all]
pre-cluster-threshold=0.4
## Set eps=0.7 and minBoxes for cluster-mode=1(DBSCAN)
eps=0.7
minBoxes=1

mchi · June 23, 2021, 1:56am

Hi @admayber ,
Sorry for delay!

What’s the GPU?

Could you use trtexec to profile the TensorRT engine generated by DeepStream, e.g. resnet34_peoplenet_pruned_int8.etlt_b1_gpu0_int8.engine

command
$ trtexec --loadEngine=resnet34_peoplenet_pruned_int8.etlt_b1_gpu0_int8.engine

and share the log

admayber · June 25, 2021, 6:18pm

I have attached the output from the following commands:

trtexec --loadEngine=resnet34_peoplenet_pruned_int8.etlt_b1_gpu0_int8.engine --int8 --calib=resnet34_peoplenet_pruned_int8_gpu.txt
trtexec --loadEngine=resnet34_peoplenet_pruned.etlt_b1_gpu0_fp16.engine --fp16

trtexec-int8.log (54.5 KB)
trtexec-fp16.log (35.7 KB)

mchi · June 26, 2021, 2:15am

From the log, INT8 infer time (mean: 0.829047 ms) is much shorter than fp16 infer time (mean: 1.32287 ms), at least, with int8, the infer time is improved much higher than 10%.
So, I think, why you only got 10% improvement with INT8 is because the infer time is very small part of the whole pipeline, and the time of the other part, e.g. decoding, pre-&post-processing, are not changes, so you get less perf improvement with INT8.

Could you increase the “batch-size”, e.g. batch-size=10 and check the fps of INT8 and FP16?

admayber · June 30, 2021, 4:22pm

With batch size = 10 I see a much a bigger difference in the TRT results.

int8 = 4.84358 ms mean
fp16 = 9.39056 ms mean

However, when I run the actual deepstream pipeline with 10 streams, the performance of fp16 and int8 is almost identical - no difference at all. This indicates that you are correct and there is probably some bottleneck in the pipeline, but I’m not sure where it could be. I have disabled everything except the source, the streammux, and the nvinfer element.

Here is my config for fp16:
nvidia-test-config.txt (3.6 KB)
config_infer_primary_peoplenet_fp16.txt (2.0 KB)

pgie config for int8:
config_infer_primary_peoplenet_int8.txt (2.1 KB)

mchi · July 1, 2021, 3:19am

please replace “type=2” to “type=1” to let the pipeline “free run”, otherwise, with “type=2”, pipeline will be always sync-ed by display/EglSink with a fixed frequency, e.g. 60fps.

[sink0]
enable=0
#Type - 1=FakeSink 2=EglSink 3=File
type=2
sync=1
source-id=0
gpu-id=0

system · September 4, 2021, 3:43am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Yolov3 INT8 performance same with FP16 DeepStream SDK	2	473	October 12, 2021
The same performance with int8 and fp16 DeepStream SDK	10	1268	October 12, 2021
Same inference speed for INT8 and FP16 TensorRT	10	5998	October 12, 2021
Run deepstream python app test1, performances are very similar when using int8 and fp32 DeepStream SDK	3	480	October 12, 2021
Int8 is not faster than fp16 on xavier Jetson AGX Xavier tensorrt	5	791	October 18, 2021
Why is' int8 'not as fast as' fp16' TensorRT tensorrt	1	581	February 1, 2021
TensorRT int8 performance Jetson AGX Xavier	4	1266	October 18, 2021
TRT Engin in INT8 is much slower than FP16 TensorRT	4	2017	November 11, 2021
TensorRT2.1 INT8 top1 and top5 is too low TensorRT	4	658	December 28, 2018
Little performance difference between int8 and fp16 on RTX2080 TensorRT	4	2623	July 5, 2021

PeopleNet int8 shows small improvement over fp16

Related topics