Please provide complete information as applicable to your setup.
• Hardware Platform (Jetson / GPU)
jetson nx xavier Module:P3668 Board:P3509-000
• DeepStream Version
DeepStreamSDK 5.0.0
• JetPack Version (valid for Jetson only)
R32 (release), REVISION: 4.4
• TensorRT Version
TensorRT Version: 7.1
• NVIDIA GPU Driver Version (valid for GPU only)
• Issue Type( questions, new requirements, bugs)
question
• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)
use my project and deepstream-app
• Requirement details( This is for new requirement. Including the module name-for which plugin or for which sample application, the function description)
Hi,I am using deepstream and yolov4 to detect person.Now I can run converted yolov4 engine.I have two questions:
1- I use input size:1 * 3 * 608 * 608(batch size is 1),then I can get 20fps using deepstream-app perf. I think 20fps is too slow.I want to know is this a normal cost time?
&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --loadEngine=./yolov4_1_3_608_608_static.engine
[11/02/2021-17:58:24] [I] === Model Options ===
[11/02/2021-17:58:24] [I] Format: *
[11/02/2021-17:58:24] [I] Model:
[11/02/2021-17:58:24] [I] Output:
[11/02/2021-17:58:24] [I] === Build Options ===
[11/02/2021-17:58:24] [I] Max batch: 1
[11/02/2021-17:58:24] [I] Workspace: 16 MB
[11/02/2021-17:58:24] [I] minTiming: 1
[11/02/2021-17:58:24] [I] avgTiming: 8
[11/02/2021-17:58:24] [I] Precision: FP32
[11/02/2021-17:58:24] [I] Calibration:
[11/02/2021-17:58:24] [I] Safe mode: Disabled
[11/02/2021-17:58:24] [I] Save engine:
[11/02/2021-17:58:24] [I] Load engine: ./yolov4_1_3_608_608_static.engine
[11/02/2021-17:58:24] [I] Builder Cache: Enabled
[11/02/2021-17:58:24] [I] NVTX verbosity: 0
[11/02/2021-17:58:24] [I] Inputs format: fp32:CHW
[11/02/2021-17:58:24] [I] Outputs format: fp32:CHW
[11/02/2021-17:58:24] [I] Input build shapes: model
[11/02/2021-17:58:24] [I] Input calibration shapes: model
[11/02/2021-17:58:24] [I] === System Options ===
[11/02/2021-17:58:24] [I] Device: 0
[11/02/2021-17:58:24] [I] DLACore:
[11/02/2021-17:58:24] [I] Plugins:
[11/02/2021-17:58:24] [I] === Inference Options ===
[11/02/2021-17:58:24] [I] Batch: 1
[11/02/2021-17:58:24] [I] Input inference shapes: model
[11/02/2021-17:58:24] [I] Iterations: 10
[11/02/2021-17:58:24] [I] Duration: 3s (+ 200ms warm up)
[11/02/2021-17:58:24] [I] Sleep time: 0ms
[11/02/2021-17:58:24] [I] Streams: 1
[11/02/2021-17:58:24] [I] ExposeDMA: Disabled
[11/02/2021-17:58:24] [I] Spin-wait: Disabled
[11/02/2021-17:58:24] [I] Multithreading: Disabled
[11/02/2021-17:58:24] [I] CUDA Graph: Disabled
[11/02/2021-17:58:24] [I] Skip inference: Disabled
[11/02/2021-17:58:24] [I] Inputs:
[11/02/2021-17:58:24] [I] === Reporting Options ===
[11/02/2021-17:58:24] [I] Verbose: Disabled
[11/02/2021-17:58:24] [I] Averages: 10 inferences
[11/02/2021-17:58:24] [I] Percentile: 99
[11/02/2021-17:58:24] [I] Dump output: Disabled
[11/02/2021-17:58:24] [I] Profile: Disabled
[11/02/2021-17:58:24] [I] Export timing to JSON file:
[11/02/2021-17:58:24] [I] Export output to JSON file:
[11/02/2021-17:58:24] [I] Export profile to JSON file:
[11/02/2021-17:58:24] [I]
[11/02/2021-17:58:28] [I] Starting inference threads
[11/02/2021-17:58:31] [I] Warmup completed 4 queries over 200 ms
[11/02/2021-17:58:31] [I] Timing trace has 61 queries over 3.12094 s
[11/02/2021-17:58:31] [I] Trace averages of 10 runs:
[11/02/2021-17:58:31] [I] Average on 10 runs - GPU latency: 50.5975 ms - Host latency: 51.1469 ms (end to end 51.1589 ms, enqueue 42.815 ms)
[11/02/2021-17:58:31] [I] Average on 10 runs - GPU latency: 50.5798 ms - Host latency: 51.1292 ms (end to end 51.1398 ms, enqueue 42.5293 ms)
[11/02/2021-17:58:31] [I] Average on 10 runs - GPU latency: 50.5728 ms - Host latency: 51.1222 ms (end to end 51.1314 ms, enqueue 42.1352 ms)
[11/02/2021-17:58:31] [I] Average on 10 runs - GPU latency: 50.5992 ms - Host latency: 51.1473 ms (end to end 51.1577 ms, enqueue 42.9337 ms)
[11/02/2021-17:58:31] [I] Average on 10 runs - GPU latency: 50.6121 ms - Host latency: 51.1613 ms (end to end 51.1721 ms, enqueue 42.3076 ms)
[11/02/2021-17:58:31] [I] Average on 10 runs - GPU latency: 50.5814 ms - Host latency: 51.1295 ms (end to end 51.1411 ms, enqueue 42.4885 ms)
[11/02/2021-17:58:31] [I] Host Latency
[11/02/2021-17:58:31] [I] min: 51.0309 ms (end to end 51.0417 ms)
[11/02/2021-17:58:31] [I] max: 51.8796 ms (end to end 51.8843 ms)
[11/02/2021-17:58:31] [I] mean: 51.1515 ms (end to end 51.1622 ms)
[11/02/2021-17:58:31] [I] median: 51.1344 ms (end to end 51.1456 ms)
[11/02/2021-17:58:31] [I] percentile: 51.8796 ms at 99% (end to end 51.8843 ms at 99%)
[11/02/2021-17:58:31] [I] throughput: 19.5454 qps
[11/02/2021-17:58:31] [I] walltime: 3.12094 s
[11/02/2021-17:58:31] [I] Enqueue Time
[11/02/2021-17:58:31] [I] min: 36.1257 ms
[11/02/2021-17:58:31] [I] max: 43.0369 ms
[11/02/2021-17:58:31] [I] median: 42.8408 ms
[11/02/2021-17:58:31] [I] GPU Compute
[11/02/2021-17:58:31] [I] min: 50.4784 ms
[11/02/2021-17:58:31] [I] max: 51.3586 ms
[11/02/2021-17:58:31] [I] mean: 50.6031 ms
[11/02/2021-17:58:31] [I] median: 50.5873 ms
[11/02/2021-17:58:31] [I] percentile: 51.3586 ms at 99%
[11/02/2021-17:58:31] [I] total compute time: 3.08679 s
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --loadEngine=./yolov4_1_3_608_608_static.engine
2- I want to inference multiple stream,i had changed batch-size to N,but fps changed to 20/N every stream.I think increase batch-size can improve fps.Is this right? Like below figure:
Here is logs when I change batch-size to 8,then using deepstream-app perf to view fps.
H264: Profile = 66, Level = 0
**PERF: 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00)
**PERF: 2.13 (1.75) 2.22 (1.52) 2.13 (1.75) 2.22 (1.52) 2.13 (1.75) 2.13 (1.75) 2.22 (1.52) 2.22 (1.52)
**PERF: 2.23 (1.86) 2.23 (1.81) 2.23 (1.86) 2.23 (1.81) 2.23 (1.86) 2.23 (1.86) 2.23 (1.81) 2.23 (1.81)
**PERF: 2.23 (1.91) 2.23 (1.88) 2.23 (1.91) 2.23 (1.88) 2.23 (1.91) 2.23 (1.91) 2.23 (1.88) 2.23 (1.88)
**PERF: 2.23 (2.17) 2.23 (2.19) 2.23 (2.17) 2.23 (2.19) 2.23 (2.17) 2.23 (2.17) 2.23 (2.19) 2.23 (2.19)
**PERF: 2.23 (2.14) 2.23 (2.15) 2.23 (2.14) 2.23 (2.15) 2.23 (2.14) 2.23 (2.14) 2.23 (2.15) 2.23 (2.15)
**PERF: 2.23 (2.11) 2.23 (2.12) 2.23 (2.11) 2.23 (2.12) 2.23 (2.11) 2.23 (2.11) 2.23 (2.12) 2.23 (2.12)
How to improve fps?I need at least 4~8 streams.But now the fps is too slow.I know that change input size(1 * 3 * 608 * 608) to smaller can do that.But I want to know is there other ways to improve fps.