Deep Learning Inference Benchmarking Instructions

I haven’t tried converting ssd_mobilenet_v3_large before, so I’m not sure of changes required to config.py, but you could use it or https://github.com/AastaNV/TRT_object_detection/tree/master/config as a starting point. If you are having trouble, you may want to post a new topic about ssd_mobilenet_v3_large in the TensorRT or Nano forum.

Hi,

We tried combination of Yolo(608x608) and tensorflow framework for object detection. We are getting very low FPS around 5 for the detection which is very low as per inference performance.

Are we missing something?
Which steps are to be followed for improving performance of the above combination?

Hi cpamritkar, you could get improved performance from using TensorRT. Recommend that you check out the YOLOv3 sample found at: /usr/src/tensorrt/samples/python/yolov3_onnx

The YOLO benchmarking app from this post also uses TensorRT - the benchmarks are for Tiny YOLOv3, but it may appear that other YOLO models could be run from that code also.

I tried to reproduce these experiments by following the above instructions, but the first 3 experiments I tried failed with errors. I suspect I have not installed the proper prerequisites, but I’m not sure. I’ll copy and paste the errors I received for each of the 3 failed experiments:

==============================================================
[SSD-Mobilenet-V2]

gavin@jetson-nano:/usr/src/tensorrt/samples/sampleUffSSD_rect$ sudo make
[sudo] password for gavin:
…/Makefile.config:7: CUDA_INSTALL_DIR variable is not specified, using /usr/local/cuda by default, use CUDA_INSTALL_DIR=<cuda_directory> to change.
…/Makefile.config:10: CUDNN_INSTALL_DIR variable is not specified, using $CUDA_INSTALL_DIR by default, use CUDNN_INSTALL_DIR=<cudnn_directory> to change.
:
Compiling: sampleUffSSD.cpp
sampleUffSSD.cpp: In function ‘void populateTFInputData(float*)’:
sampleUffSSD.cpp:92:5: error: ‘string’ was not declared in this scope
string line;
^~~~~~
sampleUffSSD.cpp:92:5: note: suggested alternatives:
In file included from /usr/include/c++/7/iosfwd:39:0,
from /usr/include/c++/7/ios:38,
from /usr/include/c++/7/ostream:38,
from /usr/include/c++/7/iostream:39,
from sampleUffSSD.cpp:5:
/usr/include/c++/7/bits/stringfwd.h:74:33: note: ‘std::__cxx11::string’
typedef basic_string string;
^~~~~~
/usr/include/c++/7/bits/stringfwd.h:74:33: note: ‘std::__cxx11::string’
sampleUffSSD.cpp:94:31: error: ‘line’ was not declared in this scope
while (getline(labelFile, line))
^~~~
sampleUffSSD.cpp:94:31: note: suggested alternative: ‘clone’
while (getline(labelFile, line))
^~~~
clone
sampleUffSSD.cpp:96:9: error: ‘istringstream’ was not declared in this scope
istringstream iss(line);
^~~~~~~~~~~~~
sampleUffSSD.cpp:96:9: note: suggested alternative:
In file included from /usr/include/c++/7/ios:38:0,
from /usr/include/c++/7/ostream:38,
from /usr/include/c++/7/iostream:39,
from sampleUffSSD.cpp:5:
/usr/include/c++/7/iosfwd:150:38: note: ‘std::istringstream’
typedef basic_istringstream istringstream;
^~~~~~~~~~~~~
sampleUffSSD.cpp:98:9: error: ‘iss’ was not declared in this scope
iss >> num;
^~~
sampleUffSSD.cpp: In function ‘void populateClassLabels(std::__cxx11::string (&)[37])’:
sampleUffSSD.cpp:111:5: error: ‘string’ was not declared in this scope
string line;
^~~~~~
sampleUffSSD.cpp:111:5: note: suggested alternatives:
In file included from /usr/include/c++/7/iosfwd:39:0,
from /usr/include/c++/7/ios:38,
from /usr/include/c++/7/ostream:38,
from /usr/include/c++/7/iostream:39,
from sampleUffSSD.cpp:5:
/usr/include/c++/7/bits/stringfwd.h:74:33: note: ‘std::__cxx11::string’
typedef basic_string string;
^~~~~~
/usr/include/c++/7/bits/stringfwd.h:74:33: note: ‘std::__cxx11::string’
sampleUffSSD.cpp:113:31: error: ‘line’ was not declared in this scope
while (getline(labelFile, line))
^~~~
sampleUffSSD.cpp:113:31: note: suggested alternative: ‘clone’
while (getline(labelFile, line))
^~~~
clone
sampleUffSSD.cpp: In function ‘nvinfer1::ICudaEngine* loadModelAndCreateEngine(const char*, int, nvuffparser::IUffParser*, nvinfer1::IHostMemory*&)’:
sampleUffSSD.cpp:170:34: error: unable to find numeric literal operator ‘operator""_MB’
builder->setMaxWorkspaceSize(128_MB); // We need about 1GB of scratch space for the plugin layer for batch size 5.
^~~~~~
sampleUffSSD.cpp:170:34: note: use -std=gnu++11 or -fext-numeric-literals to enable more built-in suffixes
sampleUffSSD.cpp: In function ‘int main(int, char**)’:
sampleUffSSD.cpp:602:5: error: ‘vector’ was not declared in this scope
vector data(N * INPUT_C * INPUT_H * INPUT_W);
^~~~~~
sampleUffSSD.cpp:602:5: note: suggested alternative:
In file included from /usr/include/c++/7/vector:64:0,
from sampleUffSSD.cpp:10:
/usr/include/c++/7/bits/stl_vector.h:216:11: note: ‘std::vector’
class vector : protected _Vector_base<_Tp, _Alloc>
^~~~~~
sampleUffSSD.cpp:602:12: error: expected primary-expression before ‘float’
vector data(N * INPUT_C * INPUT_H * INPUT_W);
^~~~~
sampleUffSSD.cpp:620:9: error: ‘data’ was not declared in this scope
data[i * volImg + 0 * volChl + j] = float(ppms[i].buffer[j * INPUT_C + 0]) - 123.68;
^~~~
sampleUffSSD.cpp:620:9: note: suggested alternative: ‘atan’
data[i * volImg + 0 * volChl + j] = float(ppms[i].buffer[j * INPUT_C + 0]) - 123.68;
^~~~
atan
sampleUffSSD.cpp:625:36: error: ‘data’ was not declared in this scope
std::cout << " Data Size " << data.size() << std::endl;
^~~~
sampleUffSSD.cpp:625:36: note: suggested alternative: ‘atan’
std::cout << " Data Size " << data.size() << std::endl;
^~~~
atan
sampleUffSSD.cpp:639:12: error: expected primary-expression before ‘float’
vector detectionOut(100000); //(N * detectionOutputParam.keepTopK * 7);
^~~~~
sampleUffSSD.cpp:640:12: error: expected primary-expression before ‘int’
vector keepCount(N);
^~~
sampleUffSSD.cpp:643:38: error: ‘detectionOut’ was not declared in this scope
doInference(*context, &data[0], &detectionOut[0], &keepCount[0], N);
^~~~~~~~~~~~
sampleUffSSD.cpp:643:38: note: suggested alternative: ‘detectionOutputParam’
doInference(*context, &data[0], &detectionOut[0], &keepCount[0], N);
^~~~~~~~~~~~
detectionOutputParam
sampleUffSSD.cpp:643:56: error: ‘keepCount’ was not declared in this scope
doInference(*context, &data[0], &detectionOut[0], &keepCount[0], N);
^~~~~~~~~
sampleUffSSD.cpp:644:5: error: ‘cout’ was not declared in this scope
cout << " KeepCount " << keepCount[0] << “\n”;
^~~~
sampleUffSSD.cpp:644:5: note: suggested alternative:
In file included from sampleUffSSD.cpp:5:0:
/usr/include/c++/7/iostream:61:18: note: ‘std::cout’
extern ostream cout; /// Linked to standard output
^~~~
…/Makefile.config:173: recipe for target ‘…/…/bin/dchobj/sampleUffSSD.o’ failed
make: *** […/…/bin/dchobj/sampleUffSSD.o] Error 1

==============================================================
[Tiny YOLO v3]

gavin@jetson-nano:/usr/src/tensorrt/bin$ sudo ./trtexec --uff=~/output_graph.uff --uffInput=input_1,1,512,512 --output=conv2d_19/Sigmoid --fp16
[sudo] password for gavin:
&&&& RUNNING TensorRT.trtexec # ./trtexec --uff=~/output_graph.uff --uffInput=input_1,1,512,512 --output=conv2d_19/Sigmoid --fp16
[00/19/2020-12:57:25] [I] === Model Options ===
[00/19/2020-12:57:25] [I] Format: UFF
[00/19/2020-12:57:25] [I] Model: ~/output_graph.uff
[00/19/2020-12:57:25] [I] Uff Inputs Layout: NCHW
[00/19/2020-12:57:25] [I] Input: input_1,1,512,512
[00/19/2020-12:57:25] [I] Output: conv2d_19/Sigmoid
[00/19/2020-12:57:25] [I] === Build Options ===
[00/19/2020-12:57:25] [I] Max batch: 1
[00/19/2020-12:57:25] [I] Workspace: 16 MB
[00/19/2020-12:57:25] [I] minTiming: 1
[00/19/2020-12:57:25] [I] avgTiming: 8
[00/19/2020-12:57:25] [I] Precision: FP16
[00/19/2020-12:57:25] [I] Calibration:
[00/19/2020-12:57:25] [I] Safe mode: Disabled
[00/19/2020-12:57:25] [I] Save engine:
[00/19/2020-12:57:25] [I] Load engine:
[00/19/2020-12:57:25] [I] Inputs format: fp32:CHW
[00/19/2020-12:57:25] [I] Outputs format: fp32:CHW
[00/19/2020-12:57:25] [I] Input build shapes: model
[00/19/2020-12:57:25] [I] === System Options ===
[00/19/2020-12:57:25] [I] Device: 0
[00/19/2020-12:57:25] [I] DLACore:
[00/19/2020-12:57:25] [I] Plugins:
[00/19/2020-12:57:25] [I] === Inference Options ===
[00/19/2020-12:57:25] [I] Batch: 1
[00/19/2020-12:57:25] [I] Iterations: 10 (200 ms warm up)
[00/19/2020-12:57:25] [I] Duration: 10s
[00/19/2020-12:57:25] [I] Sleep time: 0ms
[00/19/2020-12:57:25] [I] Streams: 1
[00/19/2020-12:57:25] [I] Spin-wait: Disabled
[00/19/2020-12:57:25] [I] Multithreading: Enabled
[00/19/2020-12:57:25] [I] CUDA Graph: Disabled
[00/19/2020-12:57:25] [I] Skip inference: Disabled
[00/19/2020-12:57:25] [I] Input inference shapes: model
[00/19/2020-12:57:25] [I] === Reporting Options ===
[00/19/2020-12:57:25] [I] Verbose: Disabled
[00/19/2020-12:57:25] [I] Averages: 10 inferences
[00/19/2020-12:57:25] [I] Percentile: 99
[00/19/2020-12:57:25] [I] Dump output: Disabled
[00/19/2020-12:57:25] [I] Profile: Disabled
[00/19/2020-12:57:25] [I] Export timing to JSON file:
[00/19/2020-12:57:25] [I] Export profile to JSON file:
[00/19/2020-12:57:25] [I]
[00/19/2020-12:57:28] [E] [TRT] UffParser: Unsupported number of graph 0
[00/19/2020-12:57:28] [E] Failed to parse uff file
[00/19/2020-12:57:28] [E] Parsing model failed
[00/19/2020-12:57:28] [E] Engine could not be created
&&&& FAILED TensorRT.trtexec # ./trtexec --uff=~/output_graph.uff --uffInput=input_1,1,512,512 --output=conv2d_19/Sigmoid --fp16

==============================================================
[U-Net Segmentation]

gavin@jetson-nano:/usr/src/tensorrt/bin$ sudo ./trtexec --output=Mconv7_stage2_L2 --deploy=…/data/googlenet/pose_estimation.prototxt --fp16 --batch=1
[sudo] password for gavin:
&&&& RUNNING TensorRT.trtexec # ./trtexec --output=Mconv7_stage2_L2 --deploy=…/data/googlenet/pose_estimation.prototxt --fp16 --batch=1
[00/19/2020-12:58:36] [I] === Model Options ===
[00/19/2020-12:58:36] [I] Format: Caffe
[00/19/2020-12:58:36] [I] Model:
[00/19/2020-12:58:36] [I] Prototxt: …/data/googlenet/pose_estimation.prototxtOutput: Mconv7_stage2_L2
[00/19/2020-12:58:36] [I] === Build Options ===
[00/19/2020-12:58:36] [I] Max batch: 1
[00/19/2020-12:58:36] [I] Workspace: 16 MB
[00/19/2020-12:58:36] [I] minTiming: 1
[00/19/2020-12:58:36] [I] avgTiming: 8
[00/19/2020-12:58:36] [I] Precision: FP16
[00/19/2020-12:58:36] [I] Calibration:
[00/19/2020-12:58:36] [I] Safe mode: Disabled
[00/19/2020-12:58:36] [I] Save engine:
[00/19/2020-12:58:36] [I] Load engine:
[00/19/2020-12:58:36] [I] Inputs format: fp32:CHW
[00/19/2020-12:58:36] [I] Outputs format: fp32:CHW
[00/19/2020-12:58:36] [I] Input build shapes: model
[00/19/2020-12:58:36] [I] === System Options ===
[00/19/2020-12:58:36] [I] Device: 0
[00/19/2020-12:58:36] [I] DLACore:
[00/19/2020-12:58:36] [I] Plugins:
[00/19/2020-12:58:36] [I] === Inference Options ===
[00/19/2020-12:58:36] [I] Batch: 1
[00/19/2020-12:58:36] [I] Iterations: 10 (200 ms warm up)
[00/19/2020-12:58:36] [I] Duration: 10s
[00/19/2020-12:58:36] [I] Sleep time: 0ms
[00/19/2020-12:58:36] [I] Streams: 1
[00/19/2020-12:58:36] [I] Spin-wait: Disabled
[00/19/2020-12:58:36] [I] Multithreading: Enabled
[00/19/2020-12:58:36] [I] CUDA Graph: Disabled
[00/19/2020-12:58:36] [I] Skip inference: Disabled
[00/19/2020-12:58:36] [I] Input inference shapes: model
[00/19/2020-12:58:36] [I] === Reporting Options ===
[00/19/2020-12:58:36] [I] Verbose: Disabled
[00/19/2020-12:58:36] [I] Averages: 10 inferences
[00/19/2020-12:58:36] [I] Percentile: 99
[00/19/2020-12:58:36] [I] Dump output: Disabled
[00/19/2020-12:58:36] [I] Profile: Disabled
[00/19/2020-12:58:36] [I] Export timing to JSON file:
[00/19/2020-12:58:36] [I] Export profile to JSON file:
[00/19/2020-12:58:36] [I]
[00/19/2020-12:58:37] [E] [TRT] CaffeParser: Could not open file …/data/googlenet/pose_estimation.prototxt
[00/19/2020-12:58:37] [E] [TRT] CaffeParser: Could not parse deploy file
[00/19/2020-12:58:37] [E] Failed to parse caffe model or prototxt, tensors blob not found
[00/19/2020-12:58:37] [E] Parsing model failed
[00/19/2020-12:58:37] [E] Engine could not be created
&&&& FAILED TensorRT.trtexec # ./trtexec --output=Mconv7_stage2_L2 --deploy=…/data/googlenet/pose_estimation.prototxt --fp16 --batch=1

Why for a single image tiny yolov3 is taking ~500 ms and for bigger batch it reduces to almost 50 ms per image

For the SSD-Mobilenet-v2 benchmark, please apply this patch to sampleUffSSD_rect/sampleUffSSD.cpp and recompile:

19a20
> using namespace std;
21c22
< static Logger gLogger;
---
> /*static*/ Logger gLogger;
169c170
<     builder->setMaxWorkspaceSize(128_MB); // We need about 1GB of scratch space for the plugin layer for batch size 5.
---
>     builder->setMaxWorkspaceSize(1024 * 1024 * 128); // We need about 1GB of scratch space for the plugin layer for batch size 5.

I was able to run the U-Net and Pose Estimation benchmarks without errors - perhaps try re-downloading the files, and make sure that your paths and command line options are correct?

Why for a single image tiny yolov3 is taking ~500 ms and for bigger batch it reduces to almost 50 ms per image

Our published results are all for batch size 1 (FP16). I just tried the Tiny-YOLOv3 benchmark again from the first page of this post with JetPack 4.3, and got 33FPS for Tiny-YOLOv3 on the Nano. It seems the updated TensorRT version in JetPack 4.3 (TensorRT 6) gave a bit better performance than before.

Sorry previously I understood it incorrectly , if I am giving a single image timing is ~500 ms , but if I am giving multiple images with 1 batch size the timing reduces a lot. I am not doing anything , I am simply running the provided code.

Sorry previously I understood it incorrectly , if I am giving a single image timing is ~500 ms , but if I am giving multiple images with 1 batch size the timing reduces a lot. I am not doing anything , I am simply running the provided code.

can i run video files using trt-yolo-app

Try running the jetson_clocks script before you run a single image with trt-yolo-app. Otherwise, often the first image can take longer to process because the processor clock frequencies need some time to realize the processing load and to spin the clocks up to their maximums. It’s recommended to run multiple images through the benchmark (even though they are all batch size 1) to get a more accurate number of the average processing time.

I believe the trt-yolo-app code is currently only setup to run on individual image files from the test_images.txt list. It does use openCV for image loading, so perhaps you could modify it to load video with openCV instead. Initially it might be easiest to dump your video to image files (i.e. using ffmpeg) and generate a test_images.txt file with the image filenames. Or if you look in the DeepStream 4 SDK, it supports YOLO natively, and that can play video files.

Hi @dusty_nv , I’m currently trying to chose the best solution in order to have reat-time object detection/recognition on the jetson nano.

The benchmark of the jetson nano led me here.

There’s something I don’t understand : in the benchmark it says YOLOv3-tiny was tested on the framework darknet. Here you use deepstream. Is it normal?

Hi @olivier.berton.leclercq, the Framework column of the table indicates which framework the model was trained in, e.g. Darknet for YOLO. All of the models were run with TensorRT, and in the case of Tiny-YOLOv3, it was using a sample from DeepStream that uses TensorRT underneath.

Hi @dusty_nv,
Ok, but I tried darknet on jetson nano and with YoloV3-tiny I have maximum 6-7 FPS. Does it means that my darknet is not well installed? (some problem of connection with tensorRT or something else for example?)

Running the Darknet code itself doesn’t use TensorRT - you have to use the TensorRT/DeepStream benchmarking application linked to in the instructions. Note that since these benchmarks were originally published, there has been a YOLOv3 sample integrated into TensorRT (if you have JetPack 4.3, see /usr/src/tensorrt/samples/python/yolov3_onnx)

Use a file such as ‘Inception_v4.prototxt’ for image classification.

I think there is no weight in this prototxt file.
Is this a pre-trained model?
If not, how does it work?

I’d appreciate your answer.

Hi @byyft2, the trtexec benchmarking program can run a Caffe network prototxt without actually needing the weights.

Hi,
I use TensorRT7.1.0+CUDA10.2 to make tiny yolo v3.There is a problem:
nvinfer1::plugin has no number “createPReLUPlugin” which is in the plugin_factory.cpp line 43.
There is no problem with TensorRT6.0+CUDA10.0.
How do I modify code if I use TensorRT7.1.0+CUDA10.2?
Thanks.