TensorRT + YOLOv3 performance issue

Hi, I’m working on some object detection models, now especially, YOLOv3, and I’d like to get a reasonably well-working object detection system on some embedded platforms like TX2 or Xavier.

In order to do so, I examined a TensorFlow version of YOLOv3 and a TensorRT version of YOLOv3 each.

  • The TensorFlow model(pb) runs under tensorflow==1.13.1(Nvidia official) with JetPack 4.2
  • The TensorRT engine has been generated in the process of 'Darknet checkpoint - ONNX model - TensorRT engine' and runs under tensorrt==5.0.6.3 with JetPack 4.2

Time profiling has been made on only the network forwarding section of each.
All the other processes like the preprocessing the input and the postprocessing of getting the bounding boxes are excluded from the profiling.

The settings above have been tested on TX2 and Xavier, and now, I’ve got the table below.
https://docs.google.com/spreadsheets/d/1IcSnF9a3SdczWmvvHNcuPGXMDanu8q7axFJSiuAQFLo/edit#gid=0&range=B2:E6
The numerics in the table are in the millisecond and they have been gotten by testing two times and then by averaging.

So, my questions are twofold.
The first one is about the ideas on dealing with the counter-intuitive results on TX2(MAXP_CORE_ALL) + TensorRT and Xavier(MAXN) + TensorRT, the red colored ones.
The second one is about the ideas on getting more performance improvements on TX2(MAXN) + TensorRT.

Any comments would be appreciated.

There was an option that I missed, jetson_clocks.
That turned every record to the expected range which is intuitive. Please refer to the sheet below.
https://docs.google.com/spreadsheets/d/1IcSnF9a3SdczWmvvHNcuPGXMDanu8q7axFJSiuAQFLo/edit?pli=1#gid=1026068223&range=B2

But, still, wonder why setting the Power Model(nvpmodel) to MAXN without using jetson_clocks produces awkward results as shown in the first sheet of the link above.

Hi @sungwonida , i’m using TensorRT-Yolov3 from lewes6369 (Caffe based) and was able to get 21ms on Xavier in MAXN mode (even w/o jetson_clocks) with FP16 precision 416px resolution.

However i also noticed that with jetson_clocks it works faster on first image, but when i run continuously in the same session i’m getting 21ms evne w/o jetson_clocks.