Deepstream yolov4 process multiple streams is slow

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU)
jetson nx xavier Module:P3668 Board:P3509-000
• DeepStream Version
DeepStreamSDK 5.0.0
• JetPack Version (valid for Jetson only)
R32 (release), REVISION: 4.4
• TensorRT Version
TensorRT Version: 7.1
• NVIDIA GPU Driver Version (valid for GPU only)
• Issue Type( questions, new requirements, bugs)
question
• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)
use my project and deepstream-app
• Requirement details( This is for new requirement. Including the module name-for which plugin or for which sample application, the function description)

Hi,I am using deepstream and yolov4 to detect person.Now I can run converted yolov4 engine.I have two questions:
1- I use input size:1 * 3 * 608 * 608(batch size is 1),then I can get 20fps using deepstream-app perf. I think 20fps is too slow.I want to know is this a normal cost time?
&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --loadEngine=./yolov4_1_3_608_608_static.engine
[11/02/2021-17:58:24] [I] === Model Options ===
[11/02/2021-17:58:24] [I] Format: *
[11/02/2021-17:58:24] [I] Model:
[11/02/2021-17:58:24] [I] Output:
[11/02/2021-17:58:24] [I] === Build Options ===
[11/02/2021-17:58:24] [I] Max batch: 1
[11/02/2021-17:58:24] [I] Workspace: 16 MB
[11/02/2021-17:58:24] [I] minTiming: 1
[11/02/2021-17:58:24] [I] avgTiming: 8
[11/02/2021-17:58:24] [I] Precision: FP32
[11/02/2021-17:58:24] [I] Calibration:
[11/02/2021-17:58:24] [I] Safe mode: Disabled
[11/02/2021-17:58:24] [I] Save engine:
[11/02/2021-17:58:24] [I] Load engine: ./yolov4_1_3_608_608_static.engine
[11/02/2021-17:58:24] [I] Builder Cache: Enabled
[11/02/2021-17:58:24] [I] NVTX verbosity: 0
[11/02/2021-17:58:24] [I] Inputs format: fp32:CHW
[11/02/2021-17:58:24] [I] Outputs format: fp32:CHW
[11/02/2021-17:58:24] [I] Input build shapes: model
[11/02/2021-17:58:24] [I] Input calibration shapes: model
[11/02/2021-17:58:24] [I] === System Options ===
[11/02/2021-17:58:24] [I] Device: 0
[11/02/2021-17:58:24] [I] DLACore:
[11/02/2021-17:58:24] [I] Plugins:
[11/02/2021-17:58:24] [I] === Inference Options ===
[11/02/2021-17:58:24] [I] Batch: 1
[11/02/2021-17:58:24] [I] Input inference shapes: model
[11/02/2021-17:58:24] [I] Iterations: 10
[11/02/2021-17:58:24] [I] Duration: 3s (+ 200ms warm up)
[11/02/2021-17:58:24] [I] Sleep time: 0ms
[11/02/2021-17:58:24] [I] Streams: 1
[11/02/2021-17:58:24] [I] ExposeDMA: Disabled
[11/02/2021-17:58:24] [I] Spin-wait: Disabled
[11/02/2021-17:58:24] [I] Multithreading: Disabled
[11/02/2021-17:58:24] [I] CUDA Graph: Disabled
[11/02/2021-17:58:24] [I] Skip inference: Disabled
[11/02/2021-17:58:24] [I] Inputs:
[11/02/2021-17:58:24] [I] === Reporting Options ===
[11/02/2021-17:58:24] [I] Verbose: Disabled
[11/02/2021-17:58:24] [I] Averages: 10 inferences
[11/02/2021-17:58:24] [I] Percentile: 99
[11/02/2021-17:58:24] [I] Dump output: Disabled
[11/02/2021-17:58:24] [I] Profile: Disabled
[11/02/2021-17:58:24] [I] Export timing to JSON file:
[11/02/2021-17:58:24] [I] Export output to JSON file:
[11/02/2021-17:58:24] [I] Export profile to JSON file:
[11/02/2021-17:58:24] [I]
[11/02/2021-17:58:28] [I] Starting inference threads
[11/02/2021-17:58:31] [I] Warmup completed 4 queries over 200 ms
[11/02/2021-17:58:31] [I] Timing trace has 61 queries over 3.12094 s
[11/02/2021-17:58:31] [I] Trace averages of 10 runs:
[11/02/2021-17:58:31] [I] Average on 10 runs - GPU latency: 50.5975 ms - Host latency: 51.1469 ms (end to end 51.1589 ms, enqueue 42.815 ms)
[11/02/2021-17:58:31] [I] Average on 10 runs - GPU latency: 50.5798 ms - Host latency: 51.1292 ms (end to end 51.1398 ms, enqueue 42.5293 ms)
[11/02/2021-17:58:31] [I] Average on 10 runs - GPU latency: 50.5728 ms - Host latency: 51.1222 ms (end to end 51.1314 ms, enqueue 42.1352 ms)
[11/02/2021-17:58:31] [I] Average on 10 runs - GPU latency: 50.5992 ms - Host latency: 51.1473 ms (end to end 51.1577 ms, enqueue 42.9337 ms)
[11/02/2021-17:58:31] [I] Average on 10 runs - GPU latency: 50.6121 ms - Host latency: 51.1613 ms (end to end 51.1721 ms, enqueue 42.3076 ms)
[11/02/2021-17:58:31] [I] Average on 10 runs - GPU latency: 50.5814 ms - Host latency: 51.1295 ms (end to end 51.1411 ms, enqueue 42.4885 ms)
[11/02/2021-17:58:31] [I] Host Latency
[11/02/2021-17:58:31] [I] min: 51.0309 ms (end to end 51.0417 ms)
[11/02/2021-17:58:31] [I] max: 51.8796 ms (end to end 51.8843 ms)
[11/02/2021-17:58:31] [I] mean: 51.1515 ms (end to end 51.1622 ms)
[11/02/2021-17:58:31] [I] median: 51.1344 ms (end to end 51.1456 ms)
[11/02/2021-17:58:31] [I] percentile: 51.8796 ms at 99% (end to end 51.8843 ms at 99%)
[11/02/2021-17:58:31] [I] throughput: 19.5454 qps
[11/02/2021-17:58:31] [I] walltime: 3.12094 s
[11/02/2021-17:58:31] [I] Enqueue Time
[11/02/2021-17:58:31] [I] min: 36.1257 ms
[11/02/2021-17:58:31] [I] max: 43.0369 ms
[11/02/2021-17:58:31] [I] median: 42.8408 ms
[11/02/2021-17:58:31] [I] GPU Compute
[11/02/2021-17:58:31] [I] min: 50.4784 ms
[11/02/2021-17:58:31] [I] max: 51.3586 ms
[11/02/2021-17:58:31] [I] mean: 50.6031 ms
[11/02/2021-17:58:31] [I] median: 50.5873 ms
[11/02/2021-17:58:31] [I] percentile: 51.3586 ms at 99%
[11/02/2021-17:58:31] [I] total compute time: 3.08679 s
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --loadEngine=./yolov4_1_3_608_608_static.engine

2- I want to inference multiple stream,i had changed batch-size to N,but fps changed to 20/N every stream.I think increase batch-size can improve fps.Is this right? Like below figure:
image

Here is logs when I change batch-size to 8,then using deepstream-app perf to view fps.
H264: Profile = 66, Level = 0
**PERF: 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00)
**PERF: 2.13 (1.75) 2.22 (1.52) 2.13 (1.75) 2.22 (1.52) 2.13 (1.75) 2.13 (1.75) 2.22 (1.52) 2.22 (1.52)
**PERF: 2.23 (1.86) 2.23 (1.81) 2.23 (1.86) 2.23 (1.81) 2.23 (1.86) 2.23 (1.86) 2.23 (1.81) 2.23 (1.81)
**PERF: 2.23 (1.91) 2.23 (1.88) 2.23 (1.91) 2.23 (1.88) 2.23 (1.91) 2.23 (1.91) 2.23 (1.88) 2.23 (1.88)
**PERF: 2.23 (2.17) 2.23 (2.19) 2.23 (2.17) 2.23 (2.19) 2.23 (2.17) 2.23 (2.17) 2.23 (2.19) 2.23 (2.19)
**PERF: 2.23 (2.14) 2.23 (2.15) 2.23 (2.14) 2.23 (2.15) 2.23 (2.14) 2.23 (2.14) 2.23 (2.15) 2.23 (2.15)
**PERF: 2.23 (2.11) 2.23 (2.12) 2.23 (2.11) 2.23 (2.12) 2.23 (2.11) 2.23 (2.11) 2.23 (2.12) 2.23 (2.12)

How to improve fps?I need at least 4~8 streams.But now the fps is too slow.I know that change input size(1 * 3 * 608 * 608) to smaller can do that.But I want to know is there other ways to improve fps.

From this, 20fps is expected. Did you use INT8?

Whether increasing batch can improve total fps depends if GPU is fully used with current batch

I use fp16,gpu is fully used.I want to know 20fps is normal?Can you test my use case in your nx board?

1 Like

I download the [yolov4.weights] from (https://github.com/AlexeyAB/darknet/releases/download/darknet_yolo_v3_optimal/yolov4.weights).

Can you help me to test how many fps on your xavier nx board?

Can you please provide your detailed instructions for qucik review?
And, did you boost NX to MAX-N?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

yes, I had boost NX to MAX-N.

I followed this post step by step.
yolov4_deepstream/deepstream_yolov4 at master · NVIDIA-AI-IOT/yolov4_deepstream · GitHub.
I only set input size to 608*608.
And I found some perf data from this link.What kind of hardware rigs can support 100+ videos analytics using deepstream?
the below picture shows the YoloV4 perf data on Xavier AGX. I want know the xavier nx perf data.
image

NX GPU compute capability is ~0.57 of Xavier GPU , and the input resolution of YoloV 416 x 416 is ~ 0.47 of your Yolo (618 x 618)
So, the fps should be expected

60 FPS (Xavier_YoloV 416x416) * ~0.57 * ~0.47 = ~ 16 fps

So, the fps you got should be expected

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.