Slow object detection speed Xavier AGX 32GB

Hello,
I’m trying to run Faster RCNN with inception v2 (image input size : 2240x400) with Tensorflow on Xavier AGX 32GB (JETPACK 4.3, TF-1.15-nv20.1) but I get only 3FPS, how can I increase it ?
I tried using default TensorRT optimization provided by Nvidia example, but it won’t speed up the model.
Same model runs on my computer with GTX1070 and the speed is about 13FPS, as I red that Xavier AGX gives simmilar performance as workstation with GTX1070.

So my main question is how to improve detection speed, or it’s not possible ?

Hi,

Have you maximize the device performance first?

sudo nvpmodel -m 0
sudo jetson_clocks

More, we have a new JetPack release.
It’s also recommended to move to our latest TensorRT 7.1 to get the best performance.

Thanks.

1 Like

Thanks for the reply,
I maximized device performace before running, so it’s not the case.

I had trouble with flashing Jetpack 4.4 to this device n(SDK manager fails to flash it), so I stayed with jetpack 4.3.
How should I update TensorRT or it comes with Jetpack 4.4 ?

And is these DLA working by default with GPU or do I need to enable them ?

What about memeory, as I understant it’s shared between GPU and RAM, should I make these fixed and how? because sometime I see that tensorflow creates GPU instance with 24GB, sometime with 20GB.

Maybe there is a Nvidia turorial, how to maximize performace of TF model for best speed and minimal loss to accuracy ?

Hi,

To figure out the issue comes from implementation or hardware, could you try the fasterRCNN sample in our TensorRT directory?

/usr/src/tensorrt/samples/sampleFasterRCNN/

Please let me know if you are already using this.

Thanks.

Hello,
I tried this sample, followed the instructions provided in readme file and got his result:

&&&& RUNNING TensorRT.sample_fasterRCNN # ./sample_fasterRCNN
[05/08/2020-09:30:12] [I] Building and running a GPU inference engine for FasterRCNN
[05/08/2020-09:30:14] [I] [TRT]
[05/08/2020-09:30:14] [I] [TRT] --------------- Layers running on DLA:
[05/08/2020-09:30:14] [I] [TRT]
[05/08/2020-09:30:14] [I] [TRT] --------------- Layers running on GPU:
[05/08/2020-09:30:14] [I] [TRT] conv1_1 + relu1_1, conv1_2 + relu1_2, pool1, conv2_1 + relu2_1, conv2_2 + relu2_2, pool2, conv3_1 + relu3_1, conv3_2 + relu3_2, conv3_3 + relu3_3, pool3, conv4_1 + relu4_1, conv4_2 + relu4_2, conv4_3 + relu4_3, pool4, conv5_1 + relu5_1, conv5_2 + relu5_2, conv5_3 + relu5_3, rpn_conv/3x3 + rpn_relu/3x3, rpn_bbox_pred, rpn_cls_score, ReshapeCTo2, rpn_cls_prob, ReshapeCTo18, RPROIFused, fc6 + relu6, fc7 + relu7, bbox_pred, cls_score, cls_prob,
[05/08/2020-09:30:16] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[05/08/2020-09:31:43] [I] [TRT] Detected 2 inputs and 3 output network tensors.
[05/08/2020-09:31:46] [I] Detected car in …/data/faster-rcnn/000456.ppm with confidence 99.0063% (Result stored in car-0.990063.ppm).
[05/08/2020-09:31:46] [I] Detected person in …/data/faster-rcnn/000456.ppm with confidence 97.4725% (Result stored in person-0.974725.ppm).
[05/08/2020-09:31:46] [I] Detected cat in …/data/faster-rcnn/000542.ppm with confidence 99.1191% (Result stored in cat-0.991191.ppm).
[05/08/2020-09:31:46] [I] Detected dog in …/data/faster-rcnn/001150.ppm with confidence 99.9603% (Result stored in dog-0.999603.ppm).
[05/08/2020-09:31:46] [I] Detected dog in …/data/faster-rcnn/001763.ppm with confidence 99.7705% (Result stored in dog-0.997705.ppm).
[05/08/2020-09:31:46] [I] Detected horse in …/data/faster-rcnn/004545.ppm with confidence 99.467% (Result stored in horse-0.994670.ppm).
&&&& PASSED TensorRT.sample_fasterRCNN # ./sample_fasterRCNN

It looks like it suceeded.
But i’m using Python for tensorflow, not CPP as in this sample.
What does it show if this sample succeded, because it doesn’t show FPS here .

Thanks.

Hi,

Sorry for the late update.

Since TensorFlow doesn’t optimize for Jetson platform, it may have some performance regression.
You can also try to run the sampleFasterRCNN on your GTX1070.
It’s expected that you can get a similar performance with a native TensorRT example.

Thanks.