Yolov3 FPS on TensorRT

Hi,

In this link: https://devblogs.nvidia.com/jetson-xavier-nx-the-worlds-smallest-ai-supercomputer/, report inferencing FPS is close to 100 FPS of YOLO-V3(608x608) on AGX Xavier with TensorRT.(Figure 3)

We follow /usr/src/tensorrt/samples/python/yolov3_onnx code, but we try YOLO-V3(608x608) on AGX Xavier with TensorRT, there’s only 9 FPS, and we try TensorRT-INT8, there’s only 32 FPS.

Why we can’t reach this high FPS?

The environment is as follows:
OS is ubuntu 18.04
TensorRT is 6.0.1

power mode:

nvpmodel -q
NV Fan Mode:quiet
NV Power Mode: MAXN

please help.

Hi,

Have you maximized the device clocks?

sudo jetson_clocks

Thanks.

Hi,

I have try maximized the device clocks, but FPS is not improve.

wow, @AastaLLL, this was not helpful…

We have the same Problem here. Only 9 FPS with YOLO v3 416x416, more like 6 FPS with 608x608.

Hi,

I’m checking this issue with our internal team.
Will update more information later.

To benchmark, it’s recommended to use trtexect.

/usr/src/tensorrt/bin/trtexec --onnx=./yolov3.onnx --workspace=[MAX] --int8

Thanks.

Hi,

We got the feedback from our internal team.

Xavier is using GPU+2DLA to achieve the throughput.
GPU latency is 26ms and DLA latency is 32.5ms.
Total throughput is 100fps.

Thanks.

Thanks for the info. Seems like the GPU alone should be able to do 30 fps then. Can you point me to an example how to run yolo at that speed? Because as i daid: for us it is 6 fps.

Thanks for the info, too.
In our implemsntation, YOLOv3 (COCO database object detection, 608*608) costs 102ms in darknet(float point), and 110ms in TensorRT(float point), 29.3ms in TensorRT(int8) for one image.

  1. It seems like your claims “GPU latency is 26ms” is implemented by TensorRT(int8). Is it right?

  2. Total throughput is 100fps.
    Is it a parallel computing by GPU and 2DLA?
    If yes, the FPS is estimated by the maximum implemented time {GPU, DLA1, DLA2}. FPS=1000/(32.5/3)~=92FPS (almost 100FPS, bias is GPUlatency probability)
    Is it?

  3. Can you also point me to an example/tutorial how to run the parallel computing by GPU and 2DLA ?
    Thanks.

Sure.

1. Prepare Xavier for the JetPack4.3 and maximize the performance with:

sudo nvpmodel -m 0
sudo jetson_clocks

2. Generate yolov3.onnx with the README shared in the /usr/src/tensorrt/samples/python/yolov3_onnx/.

3. Run the yolov3.onnx with trtexec, which targets for the profiling.

/usr/src/tensorrt/bin/trtexec --onnx=./yolov3.onnx --workspace=26 --int8

Thanks.

Hi AastaLLL,

We try to run trtexec with GPU, commend if follow as:

trtexec --onnx=yolov3_608.onnx --workspace=26 --int8

and result infomation is:

&&&& RUNNING TensorRT.trtexec # trtexec --onnx=/home/nvidia/work/yolov3_onnx/yolov3_608.onnx --workspace=26 --int8
[00/10/2020-15:36:16] [I] === Model Options ===
[00/10/2020-15:36:16] [I] Format: ONNX
[00/10/2020-15:36:16] [I] Model: /home/nvidia/work/yolov3_onnx/yolov3_608.onnx
[00/10/2020-15:36:16] [I] Output:
[00/10/2020-15:36:16] [I] === Build Options ===
[00/10/2020-15:36:16] [I] Max batch: 1
[00/10/2020-15:36:16] [I] Workspace: 26 MB
[00/10/2020-15:36:16] [I] minTiming: 1
[00/10/2020-15:36:16] [I] avgTiming: 8
[00/10/2020-15:36:16] [I] Precision: INT8
[00/10/2020-15:36:16] [I] Calibration: Dynamic
[00/10/2020-15:36:16] [I] Safe mode: Disabled
[00/10/2020-15:36:16] [I] Save engine: 
[00/10/2020-15:36:16] [I] Load engine: 
[00/10/2020-15:36:16] [I] Inputs format: fp32:CHW
[00/10/2020-15:36:16] [I] Outputs format: fp32:CHW
[00/10/2020-15:36:16] [I] Input build shapes: model
[00/10/2020-15:36:16] [I] === System Options ===
[00/10/2020-15:36:16] [I] Device: 0
[00/10/2020-15:36:16] [I] DLACore: 
[00/10/2020-15:36:16] [I] Plugins:
[00/10/2020-15:36:16] [I] === Inference Options ===
[00/10/2020-15:36:16] [I] Batch: 1
[00/10/2020-15:36:16] [I] Iterations: 10 (200 ms warm up)
[00/10/2020-15:36:16] [I] Duration: 10s
[00/10/2020-15:36:16] [I] Sleep time: 0ms
[00/10/2020-15:36:16] [I] Streams: 1
[00/10/2020-15:36:16] [I] Spin-wait: Disabled
[00/10/2020-15:36:16] [I] Multithreading: Enabled
[00/10/2020-15:36:16] [I] CUDA Graph: Disabled
[00/10/2020-15:36:16] [I] Skip inference: Disabled
[00/10/2020-15:36:16] [I] Input inference shapes: model
[00/10/2020-15:36:16] [I] === Reporting Options ===
[00/10/2020-15:36:16] [I] Verbose: Disabled
[00/10/2020-15:36:16] [I] Averages: 10 inferences
[00/10/2020-15:36:16] [I] Percentile: 99
[00/10/2020-15:36:16] [I] Dump output: Disabled
[00/10/2020-15:36:16] [I] Profile: Disabled
[00/10/2020-15:36:16] [I] Export timing to JSON file: 
[00/10/2020-15:36:16] [I] Export profile to JSON file: 
[00/10/2020-15:36:16] [I] 
----------------------------------------------------------------
Input filename:   /home/nvidia/work/yolov3_onnx/yolov3_608.onnx
ONNX IR version:  0.0.4
Opset version:    9
Producer name:    NVIDIA TensorRT sample
Producer version: 
Domain:           
Model version:    0
Doc string:       
----------------------------------------------------------------
WARNING: ONNX model has a newer ir_version (0.0.4) than this parser was built against (0.0.3).
[00/10/2020-15:36:18] [W] [TRT] Calibrator is not being used. Users must provide dynamic range for all tensors that are not Int32.
[00/10/2020-15:36:18] [I] [TRT] 
[00/10/2020-15:36:18] [I] [TRT] --------------- Layers running on DLA: 
[00/10/2020-15:36:18] [I] [TRT] 
[00/10/2020-15:36:18] [I] [TRT] --------------- Layers running on GPU: 
[00/10/2020-15:36:18] [I] [TRT] (Unnamed Layer* 0) [Convolution], (Unnamed Layer* 2) [Activation], (Unnamed Layer* 3) [Convolution], (Unnamed Layer* 5) [Activation], (Unnamed Layer* 6) [Convolution], (Unnamed Layer* 8) [Activation], (Unnamed Layer* 9) [Convolution], (Unnamed Layer* 11) [Activation], (Unnamed Layer* 12) [ElementWise], (Unnamed Layer* 13) [Convolution], (Unnamed Layer* 15) [Activation], (Unnamed Layer* 16) [Convolution], (Unnamed Layer* 18) [Activation], (Unnamed Layer* 19) [Convolution], (Unnamed Layer* 21) [Activation], (Unnamed Layer* 22) [ElementWise], (Unnamed Layer* 23) [Convolution], (Unnamed Layer* 25) [Activation], (Unnamed Layer* 26) [Convolution], (Unnamed Layer* 28) [Activation], (Unnamed Layer* 29) [ElementWise], (Unnamed Layer* 30) [Convolution], (Unnamed Layer* 32) [Activation], (Unnamed Layer* 33) [Convolution], (Unnamed Layer* 35) [Activation], (Unnamed Layer* 36) [Convolution], (Unnamed Layer* 38) [Activation], (Unnamed Layer* 39) [ElementWise], (Unnamed Layer* 40) [Convolution], (Unnamed Layer* 42) [Activation], (Unnamed Layer* 43) [Convolution], (Unnamed Layer* 45) [Activation], (Unnamed Layer* 46) [ElementWise], (Unnamed Layer* 47) [Convolution], (Unnamed Layer* 49) [Activation], (Unnamed Layer* 50) [Convolution], (Unnamed Layer* 52) [Activation], (Unnamed Layer* 53) [ElementWise], (Unnamed Layer* 54) [Convolution], (Unnamed Layer* 56) [Activation], (Unnamed Layer* 57) [Convolution], (Unnamed Layer* 59) [Activation], (Unnamed Layer* 60) [ElementWise], (Unnamed Layer* 61) [Convolution], (Unnamed Layer* 63) [Activation], (Unnamed Layer* 64) [Convolution], (Unnamed Layer* 66) [Activation], (Unnamed Layer* 67) [ElementWise], (Unnamed Layer* 68) [Convolution], (Unnamed Layer* 70) [Activation], (Unnamed Layer* 71) [Convolution], (Unnamed Layer* 73) [Activation], (Unnamed Layer* 74) [ElementWise], (Unnamed Layer* 75) [Convolution], (Unnamed Layer* 77) [Activation], (Unnamed Layer* 78) [Convolution], (Unnamed Layer* 80) [Activation], (Unnamed Layer* 81) [ElementWise], (Unnamed Layer* 82) [Convolution], (Unnamed Layer* 84) [Activation], (Unnamed Layer* 85) [Convolution], (Unnamed Layer* 87) [Activation], (Unnamed Layer* 88) [ElementWise], (Unnamed Layer* 89) [Convolution], (Unnamed Layer* 91) [Activation], (Unnamed Layer* 92) [Convolution], (Unnamed Layer* 94) [Activation], (Unnamed Layer* 95) [Convolution], (Unnamed Layer* 97) [Activation], (Unnamed Layer* 98) [ElementWise], (Unnamed Layer* 99) [Convolution], (Unnamed Layer* 101) [Activation], (Unnamed Layer* 102) [Convolution], (Unnamed Layer* 104) [Activation], (Unnamed Layer* 105) [ElementWise], (Unnamed Layer* 106) [Convolution], (Unnamed Layer* 108) [Activation], (Unnamed Layer* 109) [Convolution], (Unnamed Layer* 111) [Activation], (Unnamed Layer* 112) [ElementWise], (Unnamed Layer* 113) [Convolution], (Unnamed Layer* 115) [Activation], (Unnamed Layer* 116) [Convolution], (Unnamed Layer* 118) [Activation], (Unnamed Layer* 119) [ElementWise], (Unnamed Layer* 120) [Convolution], (Unnamed Layer* 122) [Activation], (Unnamed Layer* 123) [Convolution], (Unnamed Layer* 125) [Activation], (Unnamed Layer* 126) [ElementWise], (Unnamed Layer* 127) [Convolution], (Unnamed Layer* 129) [Activation], (Unnamed Layer* 130) [Convolution], (Unnamed Layer* 132) [Activation], (Unnamed Layer* 133) [ElementWise], (Unnamed Layer* 134) [Convolution], (Unnamed Layer* 136) [Activation], (Unnamed Layer* 137) [Convolution], (Unnamed Layer* 139) [Activation], (Unnamed Layer* 140) [ElementWise], (Unnamed Layer* 141) [Convolution], (Unnamed Layer* 143) [Activation], (Unnamed Layer* 144) [Convolution], (Unnamed Layer* 146) [Activation], (Unnamed Layer* 147) [ElementWise], (Unnamed Layer* 148) [Convolution], (Unnamed Layer* 150) [Activation], (Unnamed Layer* 151) [Convolution], (Unnamed Layer* 153) [Activation], (Unnamed Layer* 154) [Convolution], (Unnamed Layer* 156) [Activation], (Unnamed Layer* 157) [ElementWise], (Unnamed Layer* 158) [Convolution], (Unnamed Layer* 160) [Activation], (Unnamed Layer* 161) [Convolution], (Unnamed Layer* 163) [Activation], (Unnamed Layer* 164) [ElementWise], (Unnamed Layer* 165) [Convolution], (Unnamed Layer* 167) [Activation], (Unnamed Layer* 168) [Convolution], (Unnamed Layer* 170) [Activation], (Unnamed Layer* 171) [ElementWise], (Unnamed Layer* 172) [Convolution], (Unnamed Layer* 174) [Activation], (Unnamed Layer* 175) [Convolution], (Unnamed Layer* 177) [Activation], (Unnamed Layer* 178) [ElementWise], (Unnamed Layer* 179) [Convolution], (Unnamed Layer* 181) [Activation], (Unnamed Layer* 182) [Convolution], (Unnamed Layer* 184) [Activation], (Unnamed Layer* 185) [Convolution], (Unnamed Layer* 187) [Activation], (Unnamed Layer* 188) [Convolution], (Unnamed Layer* 190) [Activation], (Unnamed Layer* 191) [Convolution], (Unnamed Layer* 193) [Activation], (Unnamed Layer* 194) [Convolution], (Unnamed Layer* 196) [Activation], (Unnamed Layer* 197) [Convolution], (Unnamed Layer* 198) [Convolution], (Unnamed Layer* 200) [Activation], (Unnamed Layer* 201) [Resize], 086_upsample copy, (Unnamed Layer* 203) [Convolution], (Unnamed Layer* 205) [Activation], (Unnamed Layer* 206) [Convolution], (Unnamed Layer* 208) [Activation], (Unnamed Layer* 209) [Convolution], (Unnamed Layer* 211) [Activation], (Unnamed Layer* 212) [Convolution], (Unnamed Layer* 214) [Activation], (Unnamed Layer* 215) [Convolution], (Unnamed Layer* 217) [Activation], (Unnamed Layer* 218) [Convolution], (Unnamed Layer* 220) [Activation], (Unnamed Layer* 221) [Convolution], (Unnamed Layer* 222) [Convolution], (Unnamed Layer* 224) [Activation], (Unnamed Layer* 225) [Resize], 098_upsample copy, (Unnamed Layer* 227) [Convolution], (Unnamed Layer* 229) [Activation], (Unnamed Layer* 230) [Convolution], (Unnamed Layer* 232) [Activation], (Unnamed Layer* 233) [Convolution], (Unnamed Layer* 235) [Activation], (Unnamed Layer* 236) [Convolution], (Unnamed Layer* 238) [Activation], (Unnamed Layer* 239) [Convolution], (Unnamed Layer* 241) [Activation], (Unnamed Layer* 242) [Convolution], (Unnamed Layer* 244) [Activation], (Unnamed Layer* 245) [Convolution], 
[00/10/2020-15:36:21] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[00/10/2020-15:39:46] [I] [TRT] Detected 1 inputs and 3 output network tensors.
[00/10/2020-15:39:47] [I] Average over 10 runs is 25.7592 ms (host walltime is 25.8521 ms, 99% percentile time is 26.4878).
[00/10/2020-15:39:48] [I] Average over 10 runs is 26.8782 ms (host walltime is 27.0092 ms, 99% percentile time is 33.6098).
[00/10/2020-15:39:48] [I] Average over 10 runs is 26.711 ms (host walltime is 26.8539 ms, 99% percentile time is 33.9163).
[00/10/2020-15:39:48] [I] Average over 10 runs is 26.6445 ms (host walltime is 26.7936 ms, 99% percentile time is 32.8204).
[00/10/2020-15:39:48] [I] Average over 10 runs is 26.6838 ms (host walltime is 26.8311 ms, 99% percentile time is 33.1121).
[00/10/2020-15:39:49] [I] Average over 10 runs is 26.7551 ms (host walltime is 26.8875 ms, 99% percentile time is 33.6251).
[00/10/2020-15:39:49] [I] Average over 10 runs is 26.9908 ms (host walltime is 27.1109 ms, 99% percentile time is 32.6678).
[00/10/2020-15:39:49] [I] Average over 10 runs is 26.7439 ms (host walltime is 26.8641 ms, 99% percentile time is 33.1622).
[00/10/2020-15:39:49] [I] Average over 10 runs is 26.9623 ms (host walltime is 27.0811 ms, 99% percentile time is 33.2608).
[00/10/2020-15:39:50] [I] Average over 10 runs is 26.7927 ms (host walltime is 26.8805 ms, 99% percentile time is 33.184).
&&&& PASSED TensorRT.trtexec # trtexec --onnx=/home/nvidia/work/yolov3_onnx/yolov3_608.onnx --workspace=26 --int8

And we try to run trtexec with DLA, commend if follow as:

trtexec --onnx=/home/nvidia/work/yolov3_onnx/yolov3_608.onnx --workspace=26 --useDLACore=1 --int8
&&&& RUNNING TensorRT.trtexec # trtexec --onnx=/home/nvidia/work/yolov3_onnx/yolov3_608.onnx --workspace=26 --useDLACore=1 --int8
[00/10/2020-16:04:24] [I] === Model Options ===
[00/10/2020-16:04:24] [I] Format: ONNX
[00/10/2020-16:04:24] [I] Model: /home/nvidia/work/yolov3_onnx/yolov3_608.onnx
[00/10/2020-16:04:24] [I] Output:
[00/10/2020-16:04:24] [I] === Build Options ===
[00/10/2020-16:04:24] [I] Max batch: 1
[00/10/2020-16:04:24] [I] Workspace: 26 MB
[00/10/2020-16:04:24] [I] minTiming: 1
[00/10/2020-16:04:24] [I] avgTiming: 8
[00/10/2020-16:04:24] [I] Precision: INT8
[00/10/2020-16:04:24] [I] Calibration: Dynamic
[00/10/2020-16:04:24] [I] Safe mode: Disabled
[00/10/2020-16:04:24] [I] Save engine: 
[00/10/2020-16:04:24] [I] Load engine: 
[00/10/2020-16:04:24] [I] Inputs format: fp32:CHW
[00/10/2020-16:04:24] [I] Outputs format: fp32:CHW
[00/10/2020-16:04:24] [I] Input build shapes: model
[00/10/2020-16:04:24] [I] === System Options ===
[00/10/2020-16:04:24] [I] Device: 0
[00/10/2020-16:04:24] [I] DLACore: 1
[00/10/2020-16:04:24] [I] Plugins:
[00/10/2020-16:04:24] [I] === Inference Options ===
[00/10/2020-16:04:24] [I] Batch: 1
[00/10/2020-16:04:24] [I] Iterations: 10 (200 ms warm up)
[00/10/2020-16:04:24] [I] Duration: 10s
[00/10/2020-16:04:24] [I] Sleep time: 0ms
[00/10/2020-16:04:24] [I] Streams: 1
[00/10/2020-16:04:24] [I] Spin-wait: Disabled
[00/10/2020-16:04:24] [I] Multithreading: Enabled
[00/10/2020-16:04:24] [I] CUDA Graph: Disabled
[00/10/2020-16:04:24] [I] Skip inference: Disabled
[00/10/2020-16:04:24] [I] Input inference shapes: model
[00/10/2020-16:04:24] [I] === Reporting Options ===
[00/10/2020-16:04:24] [I] Verbose: Disabled
[00/10/2020-16:04:24] [I] Averages: 10 inferences
[00/10/2020-16:04:24] [I] Percentile: 99
[00/10/2020-16:04:24] [I] Dump output: Disabled
[00/10/2020-16:04:24] [I] Profile: Disabled
[00/10/2020-16:04:24] [I] Export timing to JSON file: 
[00/10/2020-16:04:24] [I] Export profile to JSON file: 
[00/10/2020-16:04:24] [I] 
----------------------------------------------------------------
Input filename:   /home/nvidia/work/yolov3_onnx/yolov3_608.onnx
ONNX IR version:  0.0.4
Opset version:    9
Producer name:    NVIDIA TensorRT sample
Producer version: 
Domain:           
Model version:    0
Doc string:       
----------------------------------------------------------------
WARNING: ONNX model has a newer ir_version (0.0.4) than this parser was built against (0.0.3).
[00/10/2020-16:04:26] [E] [TRT] (Unnamed Layer* 2) [Activation]: ActivationLayer (with ActivationType = LEAKY_RELU) not supported for DLA.
[00/10/2020-16:04:26] [E] [TRT] Default DLA is enabled but layer (Unnamed Layer* 2) [Activation] is not supported on DLA and falling back to GPU is not enabled.
[00/10/2020-16:04:26] [E] Engine could not be created
&&&& FAILED TensorRT.trtexec # trtexec --onnx=/home/nvidia/work/yolov3_onnx/yolov3_608.onnx --workspace=26 --useDLACore=1 --int8

Some layer is not supported on DLA, sush as LEAKY_RELU, but this document report have support LEAKY_RELU on link: https://docs.nvidia.com/deeplearning/sdk/tensorrt-support-matrix/index.html

How solve this issue?

Hi

The page you shard is TensorRT support matrix.
DLA support matrix is here:
https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#dla_layers

Activation Layer
    Functions supported: ReLU, Sigmoid, Hyperbolic Tangent
    <b>Negative slope not supported for ReLU</b>
    Only ReLU operation is supported in INT8

To solve this, try to enable –allowGPUFallback when executing.

[Activation] is not supported on DLA and falling back to GPU is not enabled.

Thanks.

Thanks for the info.

I have run trtexec with enable --allowGPUFallback, and speed report 33ms.

About your last reply:

Xavier is using GPU+2DLA to achieve the throughput.
GPU latency is 26ms and DLA latency is 32.5ms.
Total throughput is 100fps

So, 32.5ms is for DLA+GPU (is not only DLA)? If yes, why Xavier can make the speed reach 100fps.

I am not sure if my idea is correct. Because I think it should be…

DLA0 + GPU = 32.5ms
DLA1 + GPU = 32.5ms
1 / (32.5ms) * 2 = 61.5fps

Can the GPU still be used synchronously?

Hi,

You miss another one to run purely on GPU.
Try this command at the same time:

/usr/src/tensorrt/bin/trtexec --onnx=./yolov3.onnx --workspace=26 --int8

Thanks.

Sorry i’m late.

I have tried this commend, but this commend only executes GPU without DLA.

I hope to be able to synchronous execute two DLA and one GPU as you described, for a speed of 100FPS.

Do you have example about synchronous execute two DLA and one GPU?

Hi,

Only one target is allowed to deploy an TensorRT engine.
So you will need to run them separately to reach 100FPS throughput.

In short, try following commands in different console simultaneously.

/usr/src/tensorrt/bin/trtexec --onnx=./yolov3.onnx --workspace=26 --int8
/usr/src/tensorrt/bin/trtexec --onnx=./yolov3.onnx --workspace=26 --useDLACore=0 --int8
/usr/src/tensorrt/bin/trtexec --onnx=./yolov3.onnx --workspace=26 --useDLACore=1 --int8

However, please noticed that this benchmark targets for throughput.
There is no synchronous mechanism among the process.

Thanks.

Sorry I’m late,

We have try commands in different console simultaneously such as following figure:

If only run GPU case, FPS=35, but when we continuously run GPU -> DLA0 -DLA1, the speed is slow down.

The FPS curve following figure:

We also tried the original execution “trtexec”, and the same problem.

We still can’t get 100FPS when use two DLA and GPU.

Where is the problem?

1 Like

Up

Hi,

Sorry for the late update. There is some missing information of this topic.

Please noticed that there are some threading/clock issue on the trtexec in TRT6.0.
This will require you to enable the spinwait flag to get a better performance.

/usr/src/tensorrt/bin/trtexec --onnx=./yolov3.onnx --workspace=26 --int8 <b>--useSpinWait</b>

For benchmark, please also maximize the device performance first:

sudo nvpmodel -m 0
sudo jetson_clocks

And you can also play around the workspace size to get a best throughput on YOLOv3.

Thanks.

Hi AastaLLL and eveyone,

We have tried the above command parameter, and result in figuer

It looks the same of FPS issue.

We newly created new topic on this link https://devtalk.nvidia.com/default/topic/1072834/jetson-agx-xavier/how-to-use-gpu-2-dla-can-be-100fps-for-yolov3-on-xavier/, move the issue here, please follow new topic.

Thank you.