Hi,
I have nano jetson with jetpack 4.3.
I followed your example of yolov3 and installed onnx2trt package for tensort 6.
I have compiled everything and ran the yolo and it seems to work with ~194 milliseconds per inference (only inference, not including post or pre process).
the issues are:
when I change the onnx_yo_tnesorrt.py script to create engine with 16fp I ran the yolo and got 400 milliseconds per inference
2.when changing the yolov3.cfg to get smaller input then 608- for example 416 I got 604 milliseconds per inference.
is there something I am missing? was this exampled optimized for a specific input size and 32 fp?
Hi,
do you mean running the network in a batch of 5?
20 fps sounds good but I haven’t managed to get inference better than 250 milliseconds with 416 and 16fp.
can you explain what should I change in your example?
so far the things I did:
1.enable fp16 in the create engine
2.change the yolov3.cfg “width” + “height” from 608 to 416
3. change the output shapes to support the new input.
what additional things I need to do in order to get this fps?
Oh, that’s good. I stream frames via an intel realsense and with a yolov3 and OpenCV I reached around 340ms. I enabled fp16, add 416 width/height and enabled jetson clocks. I am not sure how to do your point 3.
Right now I can think of two better ways to improve my result: doing the inference through TensorRT instead of the OpenCV readNet’s and deleting the part where I show the image with the bounding boxes and the image drawing postprocess. That last part costs around 20ms.
And yes, the input of the network would be a batch of 5 images processed at once.
Hi Felipe,
about stage 3- in the sample supplied there are output shapes which influence the network- I have changed them too.
by fp16 I meant floating point 16 bits- as they called it in the SDK of tensor rt.
I guess that your performance will improve a lot if you convert the network to tensorrt.
are you working on nano jetson?
Yes sorry, floating points 16. I don’t expect the performace to improve a lot, just to get slightly better. Actually I am trying to do it in C++ and its kinda tricky.
Hopefully I will test it son on my jetson nano.
Hi,
I have tried to run your example with batch size of 2 but the run time is twice more than with one image- is this expected? no optimization for bigger batch size?
Please noticed that batchsize=2 indicates to inference two image per TensorRT call.
In general, it should be 2x execution time since the job is twice.
But there is usually some gain for a large batch size but should be layer-dependent.
Please check if you set the source and streammux component with the same batch value first.