GPU frame rate maxes when the GPU util isn't at max

I was testing a project performance on a system with a Titan RTX
the frame rate drops to ~3 fps when the GPU still has a lot of VRAM, CPU, and RAM available, and while the GPU isn’t at 100% utilization
what is the cause for that ?
the model used is a converted yolo v4


**• Hardware Platform: Titan RTX GPU **
• DeepStream Version 5.0
• TensorRT Version 7.0
• NVIDIA GPU Driver Version: 460.91.03
• Issue Type ( question / bugs)
• How to reproduce the issue? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)

1 Like

could you refer to GitHub - NVIDIA-AI-IOT/deepstream_tao_apps: Sample apps to demonstrate how to deploy models trained with TAO on DeepStream to use trtexec to profile the inference perf of the TRT engine which has only 3.8 fps?

Thanks!

And, you can extract the trtexec from the TensorRT tar package that can be downloaded from https://developer.nvidia.com/nvidia-tensorrt-7x-download .

I used an explicit batch since it’s a dynamic model
and here’s the result of running it
is this what’s needed, and what does it indicate?

 root@e06e7092f0cf:/workspace/pytorch-YOLOv4# trtexec --explicitBatch  --workspace=15120 --fp16 --optShapes=input:3x3x608x608 --maxShapes=input:30x3x608x608 --minShapes=input:1x3x608x608 --shapes=input:30x3x608x608 --useSpinWait --loadEngine=yolov4-dynamic.engine 
&&&& RUNNING TensorRT.trtexec # trtexec --explicitBatch --workspace=15120 --fp16 --optShapes=input:3x3x608x608 --maxShapes=input:30x3x608x608 --minShapes=input:1x3x608x608 --shapes=input:30x3x608x608 --useSpinWait --loadEngine=yolov4-dynamic.engine
[10/18/2021-09:17:04] [I] === Model Options ===
[10/18/2021-09:17:04] [I] Format: *
[10/18/2021-09:17:04] [I] Model: 
[10/18/2021-09:17:04] [I] Output:
[10/18/2021-09:17:04] [I] === Build Options ===
[10/18/2021-09:17:04] [I] Max batch: explicit
[10/18/2021-09:17:04] [I] Workspace: 15120 MB
[10/18/2021-09:17:04] [I] minTiming: 1
[10/18/2021-09:17:04] [I] avgTiming: 8
[10/18/2021-09:17:04] [I] Precision: FP16
[10/18/2021-09:17:04] [I] Calibration: 
[10/18/2021-09:17:04] [I] Safe mode: Disabled
[10/18/2021-09:17:04] [I] Save engine: 
[10/18/2021-09:17:04] [I] Load engine: yolov4-dynamic.engine
[10/18/2021-09:17:04] [I] Inputs format: fp32:CHW
[10/18/2021-09:17:04] [I] Outputs format: fp32:CHW
[10/18/2021-09:17:04] [I] Input build shape: input=1x3x608x608+3x3x608x608+30x3x608x608
[10/18/2021-09:17:04] [I] === System Options ===
[10/18/2021-09:17:04] [I] Device: 0
[10/18/2021-09:17:04] [I] DLACore: 
[10/18/2021-09:17:04] [I] Plugins:
[10/18/2021-09:17:04] [I] === Inference Options ===
[10/18/2021-09:17:04] [I] Batch: Explicit
[10/18/2021-09:17:04] [I] Iterations: 10
[10/18/2021-09:17:04] [I] Duration: 3s (+ 200ms warm up)
[10/18/2021-09:17:04] [I] Sleep time: 0ms
[10/18/2021-09:17:04] [I] Streams: 1
[10/18/2021-09:17:04] [I] ExposeDMA: Disabled
[10/18/2021-09:17:04] [I] Spin-wait: Enabled
[10/18/2021-09:17:04] [I] Multithreading: Disabled
[10/18/2021-09:17:04] [I] CUDA Graph: Disabled
[10/18/2021-09:17:04] [I] Skip inference: Disabled
[10/18/2021-09:17:04] [I] Inputs:
[10/18/2021-09:17:04] [I] === Reporting Options ===
[10/18/2021-09:17:04] [I] Verbose: Disabled
[10/18/2021-09:17:04] [I] Averages: 10 inferences
[10/18/2021-09:17:04] [I] Percentile: 99
[10/18/2021-09:17:04] [I] Dump output: Disabled
[10/18/2021-09:17:04] [I] Profile: Disabled
[10/18/2021-09:17:04] [I] Export timing to JSON file: 
[10/18/2021-09:17:04] [I] Export output to JSON file: 
[10/18/2021-09:17:04] [I] Export profile to JSON file: 
[10/18/2021-09:17:04] [I] 
[10/18/2021-09:17:09] [I] Warmup completed 0 queries over 200 ms
[10/18/2021-09:17:09] [I] Timing trace has 0 queries over 3.51998 s
[10/18/2021-09:17:09] [I] Trace averages of 10 runs:
[10/18/2021-09:17:09] [I] Average on 10 runs - GPU latency: 144.241 ms - Host latency: 198.804 ms (end to end 296.445 ms)
[10/18/2021-09:17:09] [I] Average on 10 runs - GPU latency: 142.773 ms - Host latency: 196.592 ms (end to end 285.503 ms)
[10/18/2021-09:17:09] [I] Host latency
[10/18/2021-09:17:09] [I] min: 195.419 ms (end to end 283.243 ms)
[10/18/2021-09:17:09] [I] max: 220.26 ms (end to end 381.42 ms)
[10/18/2021-09:17:09] [I] mean: 197.52 ms (end to end 290.225 ms)
[10/18/2021-09:17:09] [I] median: 196.625 ms (end to end 285.56 ms)
[10/18/2021-09:17:09] [I] percentile: 220.26 ms at 99% (end to end 381.42 ms at 99%)
[10/18/2021-09:17:09] [I] throughput: 0 qps
[10/18/2021-09:17:09] [I] walltime: 3.51998 s
[10/18/2021-09:17:09] [I] GPU Compute
[10/18/2021-09:17:09] [I] min: 141.602 ms
[10/18/2021-09:17:09] [I] max: 159.055 ms
[10/18/2021-09:17:09] [I] mean: 143.377 ms
[10/18/2021-09:17:09] [I] median: 142.808 ms
[10/18/2021-09:17:09] [I] percentile: 159.055 ms at 99%
[10/18/2021-09:17:09] [I] total compute time: 3.29768 s
&&&& PASSED TensorRT.trtexec # trtexec --explicitBatch --workspace=15120 --fp16 --optShapes=input:3x3x608x608 --maxShapes=input:30x3x608x608 --minShapes=input:1x3x608x608 --shapes=input:30x3x608x608 --useSpinWait --loadEngine=yolov4-dynamic.engine