Yolov3 in nanojetson

NNresearch · April 17, 2020, 11:56am

Hi,
I have nano jetson with jetpack 4.3.
I followed your example of yolov3 and installed onnx2trt package for tensort 6.
I have compiled everything and ran the yolo and it seems to work with ~194 milliseconds per inference (only inference, not including post or pre process).
the issues are:

when I change the onnx_yo_tnesorrt.py script to create engine with 16fp I ran the yolo and got 400 milliseconds per inference
2.when changing the yolov3.cfg to get smaller input then 608- for example 416 I got 604 milliseconds per inference.

is there something I am missing? was this exampled optimized for a specific input size and 32 fp?

thank you

FelipeVW · April 17, 2020, 2:41pm

I wonder also what might go wrong. What is the optimization precision the network is running on? int8 or fp32?

AastaLLL · April 20, 2020, 3:15am

Hi,

1. INT8 is not supported on the Jetson Nano.

You will need a device with GPU architecture > 7.x (ex. Xavier) to have the INT8 support.

2. Please remember to maximize the device performance first.

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

3. Please remember to recreate the TensorRT engine file when you applying the change.

For example, it’s recommended to check if TensorRT re-compile the engine when updating the network input into 416.

Thanks.

NNresearch · April 20, 2020, 7:46pm

Hi,
thank you for the reply. Do you have any expected fps or inference time?

AastaLLL · April 21, 2020, 5:45am

Hi,

We can reach 20 fps with YOLOv3 + 416 input with Deepstream.

Please noticed that the inference interval is 5, which indicates the inference is applied every 5 frames.

Thanks.

NNresearch · April 21, 2020, 6:18am

Hi,
do you mean running the network in a batch of 5?
20 fps sounds good but I haven’t managed to get inference better than 250 milliseconds with 416 and 16fp.
can you explain what should I change in your example?
so far the things I did:
1.enable fp16 in the create engine
2.change the yolov3.cfg “width” + “height” from 608 to 416
3. change the output shapes to support the new input.
what additional things I need to do in order to get this fps?

by the way nanojetson doesn’t support int8

thank you

FelipeVW · April 21, 2020, 10:36am

Oh, that’s good. I stream frames via an intel realsense and with a yolov3 and OpenCV I reached around 340ms. I enabled fp16, add 416 width/height and enabled jetson clocks. I am not sure how to do your point 3.

Right now I can think of two better ways to improve my result: doing the inference through TensorRT instead of the OpenCV readNet’s and deleting the part where I show the image with the bounding boxes and the image drawing postprocess. That last part costs around 20ms.

And yes, the input of the network would be a batch of 5 images processed at once.

Thank you

NNresearch · April 21, 2020, 12:33pm

Hi Felipe,
about stage 3- in the sample supplied there are output shapes which influence the network- I have changed them too.
by fp16 I meant floating point 16 bits- as they called it in the SDK of tensor rt.
I guess that your performance will improve a lot if you convert the network to tensorrt.
are you working on nano jetson?

AastaLLL · April 22, 2020, 2:02am

Hi,

Deepstream is an end-to-end pipeline so the FPS includes the time for camera, pre-processing and display.

The interval is different from the batch size.
Deepstream use tracker to predict the bounding box for the frame doesn’t apply inference.

The workflow should like this:

detect → tracking → tracking → tracking → tracking → detect → tracking → …

So 250ms inference time is similar to our result.

Thanks.

FelipeVW · April 22, 2020, 10:54am

Yes sorry, floating points 16. I don’t expect the performace to improve a lot, just to get slightly better. Actually I am trying to do it in C++ and its kinda tricky.
Hopefully I will test it son on my jetson nano.

NNresearch · April 25, 2020, 7:33pm

Hi,
I have tried to run your example with batch size of 2 but the run time is twice more than with one image- is this expected? no optimization for bigger batch size?

AastaLLL · May 4, 2020, 9:37am

Hi,

Please noticed that batchsize=2 indicates to inference two image per TensorRT call.
In general, it should be 2x execution time since the job is twice.

But there is usually some gain for a large batch size but should be layer-dependent.
Please check if you set the source and streammux component with the same batch value first.

Thanks.

Topic		Replies	Views
Yolov3's inference too heavy for Jetson Nano? Jetson Nano	3	670	October 15, 2021
Why I can't get 40 FPS for TLT YOLOv3 ResNet18 FP16 in 320x320? DeepStream SDK tensorrt , performance	7	832	October 12, 2021
YOLOv3 TensorRT Inference Super Slow In Nano Jetson Nano	3	1073	October 14, 2021
DeepStream 5 vs 6 inference time and calculate fps in the pipeline on Jetson Nano DeepStream SDK	9	2835	January 14, 2022
Python wrapper for tensorrt implementation of Yolo (currently v2) Jetson Nano	32	8015	July 2, 2020
Low fps when doing object detection on jetson nano Jetson Nano jetson-inference	19	8970	March 1, 2022
Yolov3 is very slow Jetson Nano	21	20256	October 14, 2021
My Jetson Nano CPU is out performing my GPU by miles! What is point of Jetson Nano? Jetson Nano yolo	2	654	October 15, 2021
Want to halve inference time TensorRT	7	765	December 25, 2023
Improve inference performances yolov5 Jetson Nano yolo , nano2gb	4	1558	June 2, 2022

Yolov3 in nanojetson

1. INT8 is not supported on the Jetson Nano.

2. Please remember to maximize the device performance first.

3. Please remember to recreate the TensorRT engine file when you applying the change.

Related topics