GPU frame rate maxes when the GPU util isn't at max

mai.algendy · October 12, 2021, 12:15pm

I was testing a project performance on a system with a Titan RTX
the frame rate drops to ~3 fps when the GPU still has a lot of VRAM, CPU, and RAM available, and while the GPU isn’t at 100% utilization
what is the cause for that ?
the model used is a converted yolo v4

**• Hardware Platform: Titan RTX GPU **
• DeepStream Version 5.0
• TensorRT Version 7.0
• NVIDIA GPU Driver Version: 460.91.03
• Issue Type ( question / bugs)
• How to reproduce the issue? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)

mchi · October 13, 2021, 1:31am

could you refer to GitHub - NVIDIA-AI-IOT/deepstream_tao_apps: Sample apps to demonstrate how to deploy models trained with TAO on DeepStream to use trtexec to profile the inference perf of the TRT engine which has only 3.8 fps?

Thanks!

mchi · October 13, 2021, 1:36am

And, you can extract the trtexec from the TensorRT tar package that can be downloaded from https://developer.nvidia.com/nvidia-tensorrt-7x-download .

mai.algendy · October 18, 2021, 7:20am

I used an explicit batch since it’s a dynamic model
and here’s the result of running it
is this what’s needed, and what does it indicate?

 root@e06e7092f0cf:/workspace/pytorch-YOLOv4# trtexec --explicitBatch  --workspace=15120 --fp16 --optShapes=input:3x3x608x608 --maxShapes=input:30x3x608x608 --minShapes=input:1x3x608x608 --shapes=input:30x3x608x608 --useSpinWait --loadEngine=yolov4-dynamic.engine 
&&&& RUNNING TensorRT.trtexec # trtexec --explicitBatch --workspace=15120 --fp16 --optShapes=input:3x3x608x608 --maxShapes=input:30x3x608x608 --minShapes=input:1x3x608x608 --shapes=input:30x3x608x608 --useSpinWait --loadEngine=yolov4-dynamic.engine
[10/18/2021-09:17:04] [I] === Model Options ===
[10/18/2021-09:17:04] [I] Format: *
[10/18/2021-09:17:04] [I] Model: 
[10/18/2021-09:17:04] [I] Output:
[10/18/2021-09:17:04] [I] === Build Options ===
[10/18/2021-09:17:04] [I] Max batch: explicit
[10/18/2021-09:17:04] [I] Workspace: 15120 MB
[10/18/2021-09:17:04] [I] minTiming: 1
[10/18/2021-09:17:04] [I] avgTiming: 8
[10/18/2021-09:17:04] [I] Precision: FP16
[10/18/2021-09:17:04] [I] Calibration: 
[10/18/2021-09:17:04] [I] Safe mode: Disabled
[10/18/2021-09:17:04] [I] Save engine: 
[10/18/2021-09:17:04] [I] Load engine: yolov4-dynamic.engine
[10/18/2021-09:17:04] [I] Inputs format: fp32:CHW
[10/18/2021-09:17:04] [I] Outputs format: fp32:CHW
[10/18/2021-09:17:04] [I] Input build shape: input=1x3x608x608+3x3x608x608+30x3x608x608
[10/18/2021-09:17:04] [I] === System Options ===
[10/18/2021-09:17:04] [I] Device: 0
[10/18/2021-09:17:04] [I] DLACore: 
[10/18/2021-09:17:04] [I] Plugins:
[10/18/2021-09:17:04] [I] === Inference Options ===
[10/18/2021-09:17:04] [I] Batch: Explicit
[10/18/2021-09:17:04] [I] Iterations: 10
[10/18/2021-09:17:04] [I] Duration: 3s (+ 200ms warm up)
[10/18/2021-09:17:04] [I] Sleep time: 0ms
[10/18/2021-09:17:04] [I] Streams: 1
[10/18/2021-09:17:04] [I] ExposeDMA: Disabled
[10/18/2021-09:17:04] [I] Spin-wait: Enabled
[10/18/2021-09:17:04] [I] Multithreading: Disabled
[10/18/2021-09:17:04] [I] CUDA Graph: Disabled
[10/18/2021-09:17:04] [I] Skip inference: Disabled
[10/18/2021-09:17:04] [I] Inputs:
[10/18/2021-09:17:04] [I] === Reporting Options ===
[10/18/2021-09:17:04] [I] Verbose: Disabled
[10/18/2021-09:17:04] [I] Averages: 10 inferences
[10/18/2021-09:17:04] [I] Percentile: 99
[10/18/2021-09:17:04] [I] Dump output: Disabled
[10/18/2021-09:17:04] [I] Profile: Disabled
[10/18/2021-09:17:04] [I] Export timing to JSON file: 
[10/18/2021-09:17:04] [I] Export output to JSON file: 
[10/18/2021-09:17:04] [I] Export profile to JSON file: 
[10/18/2021-09:17:04] [I] 
[10/18/2021-09:17:09] [I] Warmup completed 0 queries over 200 ms
[10/18/2021-09:17:09] [I] Timing trace has 0 queries over 3.51998 s
[10/18/2021-09:17:09] [I] Trace averages of 10 runs:
[10/18/2021-09:17:09] [I] Average on 10 runs - GPU latency: 144.241 ms - Host latency: 198.804 ms (end to end 296.445 ms)
[10/18/2021-09:17:09] [I] Average on 10 runs - GPU latency: 142.773 ms - Host latency: 196.592 ms (end to end 285.503 ms)
[10/18/2021-09:17:09] [I] Host latency
[10/18/2021-09:17:09] [I] min: 195.419 ms (end to end 283.243 ms)
[10/18/2021-09:17:09] [I] max: 220.26 ms (end to end 381.42 ms)
[10/18/2021-09:17:09] [I] mean: 197.52 ms (end to end 290.225 ms)
[10/18/2021-09:17:09] [I] median: 196.625 ms (end to end 285.56 ms)
[10/18/2021-09:17:09] [I] percentile: 220.26 ms at 99% (end to end 381.42 ms at 99%)
[10/18/2021-09:17:09] [I] throughput: 0 qps
[10/18/2021-09:17:09] [I] walltime: 3.51998 s
[10/18/2021-09:17:09] [I] GPU Compute
[10/18/2021-09:17:09] [I] min: 141.602 ms
[10/18/2021-09:17:09] [I] max: 159.055 ms
[10/18/2021-09:17:09] [I] mean: 143.377 ms
[10/18/2021-09:17:09] [I] median: 142.808 ms
[10/18/2021-09:17:09] [I] percentile: 159.055 ms at 99%
[10/18/2021-09:17:09] [I] total compute time: 3.29768 s
&&&& PASSED TensorRT.trtexec # trtexec --explicitBatch --workspace=15120 --fp16 --optShapes=input:3x3x608x608 --maxShapes=input:30x3x608x608 --minShapes=input:1x3x608x608 --shapes=input:30x3x608x608 --useSpinWait --loadEngine=yolov4-dynamic.engine

mchi · October 21, 2021, 3:57am

from the log - “[10/18/2021-09:17:09] [I] mean: 143.377 ms”, its inference fps could be “(1000 / 143) * batch_size” = (1000 / 143) * 30 = 209 fps.

mchi · October 21, 2021, 3:59am

how do you check the fps? what’s the pipeline? could you refer to DeepStream SDK FAQ - #10 by mchi to dump the pipeline graph?

system · November 9, 2021, 1:34am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.