Yolov3 in nanojetson

Hi,
I have nano jetson with jetpack 4.3.
I followed your example of yolov3 and installed onnx2trt package for tensort 6.
I have compiled everything and ran the yolo and it seems to work with ~194 milliseconds per inference (only inference, not including post or pre process).
the issues are:

  1. when I change the onnx_yo_tnesorrt.py script to create engine with 16fp I ran the yolo and got 400 milliseconds per inference
    2.when changing the yolov3.cfg to get smaller input then 608- for example 416 I got 604 milliseconds per inference.

is there something I am missing? was this exampled optimized for a specific input size and 32 fp?

thank you

I wonder also what might go wrong. What is the optimization precision the network is running on? int8 or fp32?

Hi,

1. INT8 is not supported on the Jetson Nano.

You will need a device with GPU architecture > 7.x (ex. Xavier) to have the INT8 support.

2. Please remember to maximize the device performance first.

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

3. Please remember to recreate the TensorRT engine file when you applying the change.

For example, it’s recommended to check if TensorRT re-compile the engine when updating the network input into 416.

Thanks.

1 Like

Hi,
thank you for the reply. Do you have any expected fps or inference time?

Hi,

We can reach 20 fps with YOLOv3 + 416 input with Deepstream.

Please noticed that the inference interval is 5, which indicates the inference is applied every 5 frames.

Thanks.

Hi,
do you mean running the network in a batch of 5?
20 fps sounds good but I haven’t managed to get inference better than 250 milliseconds with 416 and 16fp.
can you explain what should I change in your example?
so far the things I did:
1.enable fp16 in the create engine
2.change the yolov3.cfg “width” + “height” from 608 to 416
3. change the output shapes to support the new input.
what additional things I need to do in order to get this fps?

by the way nanojetson doesn’t support int8

thank you

Oh, that’s good. I stream frames via an intel realsense and with a yolov3 and OpenCV I reached around 340ms. I enabled fp16, add 416 width/height and enabled jetson clocks. I am not sure how to do your point 3.

Right now I can think of two better ways to improve my result: doing the inference through TensorRT instead of the OpenCV readNet’s and deleting the part where I show the image with the bounding boxes and the image drawing postprocess. That last part costs around 20ms.

And yes, the input of the network would be a batch of 5 images processed at once.

Thank you

Hi Felipe,
about stage 3- in the sample supplied there are output shapes which influence the network- I have changed them too.
by fp16 I meant floating point 16 bits- as they called it in the SDK of tensor rt.
I guess that your performance will improve a lot if you convert the network to tensorrt.
are you working on nano jetson?

Hi,

Deepstream is an end-to-end pipeline so the FPS includes the time for camera, pre-processing and display.

The interval is different from the batch size.
Deepstream use tracker to predict the bounding box for the frame doesn’t apply inference.

The workflow should like this:

detect -> tracking -> tracking -> tracking -> tracking -> detect -> tracking -> …

So 250ms inference time is similar to our result.

Thanks.

1 Like

Yes sorry, floating points 16. I don’t expect the performace to improve a lot, just to get slightly better. Actually I am trying to do it in C++ and its kinda tricky.
Hopefully I will test it son on my jetson nano.

Hi,
I have tried to run your example with batch size of 2 but the run time is twice more than with one image- is this expected? no optimization for bigger batch size?

Hi,

Please noticed that batchsize=2 indicates to inference two image per TensorRT call.
In general, it should be 2x execution time since the job is twice.

But there is usually some gain for a large batch size but should be layer-dependent.
Please check if you set the source and streammux component with the same batch value first.

Thanks.