Please provide complete information as applicable to your setup.
• Hardware Platform (Jetson / GPU) GPU • DeepStream Version 5.1 • JetPack Version (valid for Jetson only) • TensorRT Version 7.2 • NVIDIA GPU Driver Version (valid for GPU only) 460.80 • Issue Type( questions, new requirements, bugs) Question • How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)
I am running peoplenet directly out of the 5.1-21.02-devel container using the included deepstream_app_source1_peoplenet.txt. On our test videos, we have found that using the int8 quantized model over fp16 only yields about a 10% increase in fps on a single stream. This seems low but perhaps my expectations are incorrect - I was expecting a much bigger performance increase for int8. I have the following questions:
Is this the expected performance increase for fp16 vs int8 on a single stream? If not, how much is expected?
If so, is there a bigger performance increase to be expected from batching? If so, how much?
I am including my pgie configuration for int8 below for diagnostic purposes. In the fp16 setting, I am using the default config_infer_primary_peoplenet.txt included in the container with the fp16 pruned model. For int8, I am using the below config with the quantized pruned model.
From the log, INT8 infer time (mean: 0.829047 ms) is much shorter than fp16 infer time (mean: 1.32287 ms), at least, with int8, the infer time is improved much higher than 10%.
So, I think, why you only got 10% improvement with INT8 is because the infer time is very small part of the whole pipeline, and the time of the other part, e.g. decoding, pre-&post-processing, are not changes, so you get less perf improvement with INT8.
Could you increase the “batch-size”, e.g. batch-size=10 and check the fps of INT8 and FP16?
With batch size = 10 I see a much a bigger difference in the TRT results.
int8 = 4.84358 ms mean
fp16 = 9.39056 ms mean
However, when I run the actual deepstream pipeline with 10 streams, the performance of fp16 and int8 is almost identical - no difference at all. This indicates that you are correct and there is probably some bottleneck in the pipeline, but I’m not sure where it could be. I have disabled everything except the source, the streammux, and the nvinfer element.
please replace “type=2” to “type=1” to let the pipeline “free run”, otherwise, with “type=2”, pipeline will be always sync-ed by display/EglSink with a fixed frequency, e.g. 60fps.