Deep Learning Inference Benchmarking Instructions

I found that I could get the MobileNet v2 SSD example to detect something by making the following changes to sampleUffSSD.cpp:

  • reduce visualizeThreshold from 0.4 to 0.2
  • uncomment the code near the end of main() that outputs the detections
  • replace the labels with indices. I did this by replacing the code in populateClassLabels() with:
for (int i = 0; i < OUTPUT_CLS_SIZE; ++i)
	{
		CLASSES[i] = std::to_string(i);
	}

Also make sure that it can find the dog.ppm image on your system (on my system dog.ppm was installed in /usr/src/tensorrt/data/ssd).

After these changes the network was able to detect both the dogs in the picture.

Hi,dusty_nv:
Could you supply an openpose demo with ‘bmp’ or ‘png’ ro ‘jpg’ file input, and finial show or save the picture pose result on the disk.
Thank you very much!!!

In the instructions for Tiny Yolo v3 where it says:
“In the file yolov3-tiny.txt, search for “–precision=kINT8” and replace “kINT8” with “kHALF” to change the inference precision to FP16.”
it should also say that this line needs to be uncommented.

Thanks SB_97, I just updated the instructions for Tiny-YOLOv3 with this step.

sorry but I have to ask,after running yolo how to test the accuracy, in addition to the speed?

Hi dusty_nv,

I have been trying to convert and run my custom 6 class tensorflow ssd_inception_v2 model on jetson nano for a while, but can’t come close to your benchmark performance. I worked base on tensorrt python uff_ssd sample project and get 19fps performance.

Then I tried uff model from your SSD-Mobilenet-V2 project and get 26fps, which is still quite bad compared to your 39fps result. Where this performance difference come from?

When I look at your sampleUffSSD_rect source code, I saw that you just calculate context.execute() timing, memory operations not included. In reality this is not applicable and you must preprocess and memcpy every frame to send cuda device. When I placed your time calculation functions for doInference() function, I get 28fps result, which is quite similar to my python app result :)

Please correct me if I am wrong…

Hi fuatka, the benchmark was from a C++ application using the converted UFF. Noted in the blog benchmarks, the timing results are of the network to provide an apples-to-apples comparison between networks, as pre-processing may vary on the network, platform, and application requirements, and if you are using CUDA mapped zeroCopy memory or CUDA managed memory, extra memory copies aren’t required.

@fuatka Have you run the jetson_clocks script? I feel like I had better results than 19fps on that example, but you are correct that that code leaves out the memcpy. As @dusty_nv, you can use zero-copy but it’s not implemented in the sample. Remember that RAM and GPU memory are the same on Jetson devices, which can save a lot of copy costs.

1 Like

Hi atyshka,

I am getting 19fps with my custom ssd_inception_v2 model, ssd_mobilenet_v2 model run faster about 26fps.

jetson_clock script generally makes %10 performance improvement.

Hi dusty_nv,

In your sampleUffSSD_rect project, you convert uff file to trt engine with loadModelAndCreateEngine function, which is faster than uff_ssd python sample application.

Is it possible to save this trt engine to a file and then load from file?

Hi fuatka, see this documentation about serializing and deserializing engine files from C++ and Python:

Hi,

I tried the mobilenet_v2 SSD benchmark and it produced the reference run times. However if I tried to evaluate, the network result contained noise only (low confidences per detection, nothing useful). If I tried it with my own uff (based on the instructions in sampleUffSSD example) it produced the right results, but the performance decreased by 40%. Could you give us a brief description about the sample_unpruned_mobilenet_v2.uff file in the sd-mobilenet-v2 archive? How can I reproduce it, because if I follow the instructions in sampleUffSSD example the performance is worse.

Hi broothy, we are preparing a benchmarking suite for release that interprets the results of the model correctly. The current sample measures the inferencing time of the network. For now, you could reference this GitHub for getting the output of SSD-Mobilenet-v2 model:

https://github.com/AastaNV/TRT_object_detection

Hi dusty_nv,

I tried many networks and measured a lot in the past few days.

I’ve checked the repo you mentioned (https://github.com/AastaNV/TRT_object_detection). I was able to run the python source (inference time was ~47 millisecs). I extracted the uff file and integrated it into my c++ project, it produced the same inference time, the detections were fine.

I’ve checked out SSD mobilenet v2 from your repo (Releases · dusty-nv/jetson-inference · GitHub), the inference time was ~48 millisecs, the detections were ok.

Each SSD mobilenet v2 with meaningful result produced ~48 millisecs inference time (21 fps). Only the detector included in the benchmark suite produced the ~27 millisecs run time, however it’s output was only noise. I’ve tried any possible input combinations (NCHW, NHWC, 0…1, -1…1, -128…128, 0…255) but the output was still a mess. The output parsing part of referred code (https://github.com/AastaNV/TRT_object_detection) is the same I saw in the benchmark code.

Are you 100% sure that sample_unpruned_mobilenet_v2.uff contains a valid SSD mobilenet v2 detector? Each SSD mobilenet v2 detector I tested had ~50% longer inference time and I was not able to validate the result of yours. Could you validate it or provide a walk through how can I reproduce the .uff file, or a proper way to parse the output?

About the measurements: I run each code in runlevel 3 without X after nvpmodel -m 0 and jetson_clocks.

@broothy I noticed the same thing and realized that the model given was NOT trained on the COCO dataset. Going through the code, it uses 37 classes whereas coco uses 91. The only dataset I could find with 37 classes was a dog breed dataset. Sure enough, feeding in the proper labels and dog images produced correct results. Therefore if you want to detect anything other than dogs, you’ll need to use one of the other models. As to why it is faster, my guess is that since there are fewer classes, it requires fewer convolutional layers. Hope that helps!

Hi

I’m attempting to run the SSD-Mobilenet-V2 test. However when I attempt to make the sampleUffSSD_rect, I receive the following:

Compiling: sampleUffSSD.cpp
sampleUffSSD.cpp:21:15: error: ‘gLogger’ was declared ‘extern’ and later ‘static’ [-fpermissive]
 static Logger gLogger;
               ^~~~~~~
In file included from ../common/common.h:55:0,
                 from BatchStreamPPM.h:9,
                 from sampleUffSSD.cpp:12:
../common/logger.h:55:15: note: previous declaration of ‘gLogger’
 extern Logger gLogger;

Any ideas?

Many thanks

Hi Mark, it looks like there was an update in zJetPack 4.2.1 to the sample utils of TensorRT - we’ll have to take a look at updating it. In the meantime, you might want to try commenting out the declaration of gLogger object in sampleUffSSD.cpp

Thanks very much. Commenting out didn’t help, but removing static did.

I get the expected ~26ms timing for inference.

Is there an explanation as to why this is approximately twice as fast as the example at https://github.com/AastaNV/TRT_object_detection? Is it just because of the python overhead?

Hi Mark, that model was trained on 37 classes whereas the COCO models are trained on 91 classes. See this post for more info:

[url]Deep Learning Inference Benchmarking Instructions - Jetson Nano - NVIDIA Developer Forums

I did
git clone GitHub - NVIDIA-AI-IOT/deepstream_reference_apps: Samples for TensorRT/Deepstream for Tesla & Jetson

But did not get yolo:

dlinano@jetson-nano:~/git/deepstream_reference_apps$ tree -d
.
├── CaffeMNIST
│   ├── data
│   └── nvdsinfer_custom_impl_CaffeMNIST
├── anomaly
│   ├── apps
│   │   └── deepstream-anomaly-detection
│   ├── config
│   └── plugins
│       ├── gst-dsdirection
│       │   └── dsdirection_lib
│       └── gst-dsopticalflow
│           └── dsopticalflow_lib
└── senet
    ├── Revised_Scripts
    ├── apps
    │   ├── deepstream-senet
    │   └── trt-senet
    ├── config
    ├── data
    └── lib

Terveisin, Markus