DeepStream 5 vs 6 inference time and calculate fps in the pipeline on Jetson Nano

Hi, my previously setup is DeepStream SDK: How to use NvDsInferNetworkInfo get network input shape in Python.

currently setup is:

• Hardware Platform (Jetson Nano)
• DeepStream Version 6.0
• JetPack Version (valid for Jetson only) 4.6
• TensorRT Version 8.0.1

I using pipeline on DeepStream, and follow How to check inference time for a frame when using deepstream measure inference time for a frame.
Finally, I using yolov3 get 6m ~ 10m sec., and all flow get 2~28fps in the pipeline on DeepStream 5.
And then, I using yolov3 get 200m ~ 220m sec., and all flow get 4fps in the pipeline on DeepStream 6.

my output on deepstream 6 such as:

measure inference time for a frame

No. 1
time of infer takes: 11405 us

measure inference time for a frame
measure inference time for a frame

No. 2
time of infer takes: 5210 us

measure inference time for a frame
measure inference time for a frame

No. 3
time of infer takes: 5570 us

measure inference time for a frame
measure inference time for a frame

No. 4
time of infer takes: 187192 us

measure inference time for a frame
measure on sink while loop

One Frame Cost 282.562 s
FPS= 0.003539048751972224

measure on sink while loop
measure inference time for a frame

time of infer takes: 6475 us (very fast)

measure inference time for a frame
measure on sink while loop

One Frame Cost 0.260 s
FPS= 3.839397750719034

measure on sink while loop

time of infer takes: 7896 us (very fast)
One Frame Cost 0.261 s
FPS= 3.8267486216400908
time of infer takes: 5789 us (very fast)
One Frame Cost 0.259 s
FPS= 3.854329051668526
time of infer takes: 207265 us
One Frame Cost 0.236 s
FPS= 4.231370808280538
time of infer takes: 217379 us
One Frame Cost 0.254 s
FPS= 3.9308247479684244
time of infer takes: 212361 us
One Frame Cost 0.253 s
FPS= 3.9527439139318
time of infer takes: 218917 us
One Frame Cost 0.255 s
FPS= 3.923030372754406

So, I want to ask you some questions.

  1. Why the first three frame forever without send to python api, only measure inference time?
  2. Whether only measure inference time or measure all flow cost time on DeepStream 5. I thought I was going to calculate average time(7m sec./ 100 frames) was very fast, but I found it was very unstable.
  3. Although it appears stable on DeepStream 6, but sometimes only measure inference time very fast (5m ~ 8m sec.). Why have something like that?

Thanks.

I am noticing slow FPS with YoloV3 on a Tesla T4. Probably related. Yolo V3 slow - #2 by mfoglio

  • Why the first three frame forever without send to python api, only measure inference time?

    → Can you specify which app you are using?

  • Whether only measure inference time or measure all flow cost time on DeepStream 5. I thought I was going to calculate average time(7m sec./ 100 frames) was very fast, but I found it was very unstable.

    –>When measuring performance, boost clock to make sure you get stable data.
    sudo nvpmodel -m 0 //you can get model level from /etc/nvpmodel.conf
    sudo jetson_clocks

  • Although it appears stable on DeepStream 6, but sometimes only measure inference time very fast (5m ~ 8m sec.). Why have something like that?

    → Did you mean you get big different performance for yolov3 on ds5 and ds6?
    *if yes, refer to this post, yolo perf low on ds6, there one fix, refer to comment 22 *
    Deepstream 6 YOLO performance issue - Intelligent Video Analytics / DeepStream SDK - NVIDIA Developer Forums

  • Can you specify which app you are using?
    → I reference deepstream_python_apps/apps/deepstream-ssd-parser and using custom python code.
    I use custom YOLOv3 ONNX model output tensor data via DeepStream(NvInfer), and during the process of post-process in Python.

  • When measuring performance, boost clock to make sure you get stable data.
    sudo nvpmodel -m 0 //you can get model level from /etc/nvpmodel.conf
    sudo jetson_clocks
    → I was tried, but same issue.

  • Did you mean you get big different performance for yolov3 on ds5 and ds6?
    if yes, refer to this post, yolo perf low on ds6, there one fix, refer to comment 22
    Deepstream 6 YOLO performance issue - Intelligent Video Analytics / DeepStream SDK - NVIDIA Developer Forums
    → I mean the same engine have a large rate difference within the nvinfer.
    Although I get big different performance for yolov3 on ds5 and ds6, it was very unstable on ds5.
    So can not compares ds5(always big gap) to ds6(somtimes big gap) currently.

  • Can you specify which app you are using?
    → I reference deepstream_python_apps/apps/deepstream-ssd-parser and using custom python code.
    Since I use custom YOLOv3 ONNX model output tensor data via DeepStream(NvInfer), and during the process of post-process in Python.

    [amycao]Deepstream running in asynchronous mode, so you not get the first three frames in the probe is expected.
  • When measuring performance, boost clock to make sure you get stable data.
    sudo nvpmodel -m 0 //you can get model level from /etc/nvpmodel.conf
    sudo jetson_clocks
    → I was tried, but same issue.
  • Did you mean you get big different performance for yolov3 on ds5 and ds6?
    if yes, refer to this post, yolo perf low on ds6, there one fix, refer to comment 22
    Deepstream 6 YOLO performance issue - Intelligent Video Analytics / DeepStream SDK - NVIDIA Developer Forums
    → I mean the same engine have a large rate difference within the nvinfer.
    Although I get big different performance for yolov3 on ds5 and ds6, it was very unstable on ds5.
    So can not compares ds5(always big gap) to ds6(somtimes big gap) currently.

    [amycao]It may not be appropriate to calculate m_BackendContext->enqueueBuffer(backendBuffer,
    *m_InferStream, m_InputConsumedEvent.get() run time as inference time. since infer running in different cudastream, it’s asynchomous. it may not finish inference. i think that’s why you get the infer time deviation. you should use trtexec to get the inference time. you can find it under /usr/src/tensorrt/bin/

Hi,

Thanks for the reply.
When I use comman /usr/src/tensorrt/bin/trtexec --onnx=./yolov3-512.onnx --batch=1 --saveEngine=test.engine --fp16 --verbose to get information, please refer to the attached file build.log (782.6 KB)
I wan to ask you where is inference time?

Thanks.

Please remove --verbose option and run again.
you will see the log as below, use GPU Compute Time median value as inference time.
[12/17/2021-09:23:59] [I] === Performance summary ===
[12/17/2021-09:23:59] [I] Throughput: 592.779 qps
[12/17/2021-09:23:59] [I] Latency: min = 1.44849 ms, max = 5.65085 ms, mean = 1.67221 ms, median = 1.46509 ms, percentile(99%) = 4.15195 ms
[12/17/2021-09:23:59] [I] End-to-End Host Latency: min = 1.45703 ms, max = 9.76761 ms, mean = 1.68689 ms, median = 1.47607 ms, percentile(99%) = 4.1666 ms
[12/17/2021-09:23:59] [I] Enqueue Time: min = 0.324463 ms, max = 0.799561 ms, mean = 0.361476 ms, median = 0.349609 ms, percentile(99%) = 0.50943 ms
[12/17/2021-09:23:59] [I] H2D Latency: min = 0.11377 ms, max = 0.377777 ms, mean = 0.131549 ms, median = 0.115112 ms, percentile(99%) = 0.322327 ms
[12/17/2021-09:23:59] [I] GPU Compute Time: min = 1.32703 ms, max = 5.31146 ms, mean = 1.5326 ms, median = 1.34253 ms, percentile(99%) = 3.81232 ms
[12/17/2021-09:23:59] [I] D2H Latency: min = 0.0065918 ms, max = 0.0196533 ms, mean = 0.00805794 ms, median = 0.00732422 ms, percentile(99%) = 0.0194092 ms
[12/17/2021-09:23:59] [I] Total Host Walltime: 3.00112 s
[12/17/2021-09:23:59] [I] Total GPU Compute Time: 2.72649 s
[12/17/2021-09:23:59] [I] Explanations of the performance metrics are printed in the verbose logs.
[12/17/2021-09:23:59] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8001] # /usr/src/tensorrt/bin/trtexec --loadEngine=samples/models/Primary_Detector/resnet10.caffemodel_b4_gpu0_int8.engine
[12/17/2021-09:23:59] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 905, GPU 15312 (MiB)
Developer Guide :: NVIDIA Deep Learning TensorRT Documentation

Hi,

I’m sorry for the late response.
Thanks for the reply.
After I use above method, already get stable inference time.
It seems that inference time was limited on jetson nano.
From my understanding, I think that I can also replace tenosrrt with tvm/glow compile.
Is my understanding correct?

Thanks.

You can set interval in nvinver to skip batch processing.
interval: Specifies the number of consecutive batches to be skipped for inference
https://docs.nvidia.com/metropolis/deepstream/dev-guide/text/DS_plugin_gst-nvinfer.html

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.