Detectnet Performance on Jetson TX2

Hello,

I am using the Detectnet model trained using DIGITS as per the tutorial at: https://github.com/NVIDIA/DIGITS/blob/master/examples/object-detection/README.md.

I am just training it for one class, namely, cars. I want to attain a real time performance on embedded device like Jetson TX2. Hence, I used the jetson-inference as it was mentioned in other posts.

Currently, if I disable visualization of bounding boxes, I am able to attain a maximum of 12 FPS. I am testing the network on a 640x360 video, have enabled the maximum performance mode (Max-N), and ran the jetson_clocks.sh script too.

I am looking to get a higher frame rate and would want to do tracking too. I am detecting about 10-15 cars in each frame depending on the traffic.

Are there any suggestions to improve the frame-rate or maybe I will have to work on a higher end GPUs? Or maybe I need to use frameworks like YOLO?

Edit:

  • By disabling visualization, I mean that not displaying the bounding boxes. It still produces the GL display and shows the video.
  • Note that in the detectnet inference, I have modified the merge rectangle method to match it with the non-max suppression rather than an overly simplified way to detect any overlap. In my case, the objects are very close to each other thus the need of better merge rectangles algorithm. But, even with the simple

Hi,

Jetson_infernce has a memory copy step when passing CPU buffer to CUDA.
You can check our another sample for zero-copy demonstration.

Backend sample in MMAPI. It can reach around 20fps for CAR detection:

./backend 1 ../../data/Video/sample_outdoor_car_1080p_10fps.h264 H264 \
        --trt-deployfile ../../data/Model/GoogleNet_one_class/GoogleNet_modified_oneClass_halfHD.prototxt \
        --trt-modelfile ../../data/Model/GoogleNet_one_class/GoogleNet_modified_oneClass_halfHD.caffemodel \
        --trt-forcefp32 0 --trt-proc-interval 1 -fps 20

Here are some suggestions to enable real-time detection:

1. Apply detection bi-framely:
If this strategy is acceptable, you can adjust the trt-proc-interval to 2 and arise fps to 30.

--trt-proc-interval 2 -fps 30

2. Use smaller input size:
Currently, network input size is 960x540.
Smaller input size can reach higher performance, but the missing rate also increases.

3. Use more lightweight model, ex. YOLO

Thanks.

Thanks, AstaLLL.

Your suggestions are quite helpful. I will surely try these.

However, currently, I am stuck with the segmentation fault possible because of the two versions of TensorRT as described here:
https://devtalk.nvidia.com/default/topic/1024441/jetson-tx2/tensorrt-3-0-rc-now-available-with-support-for-tensorflow/post/5220161/#5220161

Do you think I should put this in a separate post?

Update:
A response has been provided to the original post. Thanks.