Hi,dusty_nv:
Could you supply an openpose demo with ‘bmp’ or ‘png’ ro ‘jpg’ file input, and finial show or save the picture pose result on the disk.
Thank you very much!!!
In the instructions for Tiny Yolo v3 where it says:
“In the file yolov3-tiny.txt, search for “–precision=kINT8” and replace “kINT8” with “kHALF” to change the inference precision to FP16.”
it should also say that this line needs to be uncommented.
I have been trying to convert and run my custom 6 class tensorflow ssd_inception_v2 model on jetson nano for a while, but can’t come close to your benchmark performance. I worked base on tensorrt python uff_ssd sample project and get 19fps performance.
Then I tried uff model from your SSD-Mobilenet-V2 project and get 26fps, which is still quite bad compared to your 39fps result. Where this performance difference come from?
When I look at your sampleUffSSD_rect source code, I saw that you just calculate context.execute() timing, memory operations not included. In reality this is not applicable and you must preprocess and memcpy every frame to send cuda device. When I placed your time calculation functions for doInference() function, I get 28fps result, which is quite similar to my python app result :)
Hi fuatka, the benchmark was from a C++ application using the converted UFF. Noted in the blog benchmarks, the timing results are of the network to provide an apples-to-apples comparison between networks, as pre-processing may vary on the network, platform, and application requirements, and if you are using CUDA mapped zeroCopy memory or CUDA managed memory, extra memory copies aren’t required.
@fuatka Have you run the jetson_clocks script? I feel like I had better results than 19fps on that example, but you are correct that that code leaves out the memcpy. As @dusty_nv, you can use zero-copy but it’s not implemented in the sample. Remember that RAM and GPU memory are the same on Jetson devices, which can save a lot of copy costs.
In your sampleUffSSD_rect project, you convert uff file to trt engine with loadModelAndCreateEngine function, which is faster than uff_ssd python sample application.
Is it possible to save this trt engine to a file and then load from file?
I tried the mobilenet_v2 SSD benchmark and it produced the reference run times. However if I tried to evaluate, the network result contained noise only (low confidences per detection, nothing useful). If I tried it with my own uff (based on the instructions in sampleUffSSD example) it produced the right results, but the performance decreased by 40%. Could you give us a brief description about the sample_unpruned_mobilenet_v2.uff file in the sd-mobilenet-v2 archive? How can I reproduce it, because if I follow the instructions in sampleUffSSD example the performance is worse.
Hi broothy, we are preparing a benchmarking suite for release that interprets the results of the model correctly. The current sample measures the inferencing time of the network. For now, you could reference this GitHub for getting the output of SSD-Mobilenet-v2 model:
I tried many networks and measured a lot in the past few days.
I’ve checked the repo you mentioned (https://github.com/AastaNV/TRT_object_detection). I was able to run the python source (inference time was ~47 millisecs). I extracted the uff file and integrated it into my c++ project, it produced the same inference time, the detections were fine.
Each SSD mobilenet v2 with meaningful result produced ~48 millisecs inference time (21 fps). Only the detector included in the benchmark suite produced the ~27 millisecs run time, however it’s output was only noise. I’ve tried any possible input combinations (NCHW, NHWC, 0…1, -1…1, -128…128, 0…255) but the output was still a mess. The output parsing part of referred code (https://github.com/AastaNV/TRT_object_detection) is the same I saw in the benchmark code.
Are you 100% sure that sample_unpruned_mobilenet_v2.uff contains a valid SSD mobilenet v2 detector? Each SSD mobilenet v2 detector I tested had ~50% longer inference time and I was not able to validate the result of yours. Could you validate it or provide a walk through how can I reproduce the .uff file, or a proper way to parse the output?
About the measurements: I run each code in runlevel 3 without X after nvpmodel -m 0 and jetson_clocks.
@broothy I noticed the same thing and realized that the model given was NOT trained on the COCO dataset. Going through the code, it uses 37 classes whereas coco uses 91. The only dataset I could find with 37 classes was a dog breed dataset. Sure enough, feeding in the proper labels and dog images produced correct results. Therefore if you want to detect anything other than dogs, you’ll need to use one of the other models. As to why it is faster, my guess is that since there are fewer classes, it requires fewer convolutional layers. Hope that helps!
I’m attempting to run the SSD-Mobilenet-V2 test. However when I attempt to make the sampleUffSSD_rect, I receive the following:
Compiling: sampleUffSSD.cpp
sampleUffSSD.cpp:21:15: error: ‘gLogger’ was declared ‘extern’ and later ‘static’ [-fpermissive]
static Logger gLogger;
^~~~~~~
In file included from ../common/common.h:55:0,
from BatchStreamPPM.h:9,
from sampleUffSSD.cpp:12:
../common/logger.h:55:15: note: previous declaration of ‘gLogger’
extern Logger gLogger;
Hi Mark, it looks like there was an update in zJetPack 4.2.1 to the sample utils of TensorRT - we’ll have to take a look at updating it. In the meantime, you might want to try commenting out the declaration of gLogger object in sampleUffSSD.cpp