How to improve py-faster-caffe performance on JTX1?

I followed AastaLLL’s instruction in the ‘Caffe failed with py-faster-rcnn demo.py on TX1’ post, and was able to build and run py-faster-rcnn demo script on Jetson TX1.

https://devtalk.nvidia.com/default/topic/974063/jetson-tx1/caffe-failed-with-py-faster-rcnn-demo-py-on-tx1/post/5010194/#5010194

However, comparing to a GeForce GPU card, the inference performance on JTX1 is lackluster. More specifically, it takes roughly 1.8s to processing each image in the py-faster-rcnn demo on JTX1. In contrast, that takes only ~0.09s on my x64 PC with a GTX-1080 graphics card. I have tried to force JTX1 to always run at maximum clock speeds by running the ~/jetson_clocks.sh script, but that doesn’t help too much. For other DNN/CNN tasks, I typically see less than 10X performance difference between JTX1 and the GTX-1080 PC. But for this py-faster-rcnn case, JTX1 is really falling behind.

Are there any suggestions about how to improve py-faster-caffe inference performance on JTX1? Thanks.

Screenshot of JTX1 case:

Screenshot of GTX-1080 case:

JTX1.jpg
GTX1080.jpg

JTX1.jpg

GTX1080.jpg

The Jetson eMMC disk I/O is much slower than a typical desktop SSD.
Maybe the problem is at least partly in the load-and-setup part of the code, rather than the inference part of the code?
If I were you, I’d instrument the code to time itself, once the model is fully loaded and ready to run.
The reason for this is that, in a finished embedded system, you’d load the model just once, but you’d repeatedly run the inference as a service to the rest of the system.

The py-faster-rcnn demo.py script does load the model only once, and then runs inference on 5 test images consecutively. In fact, it runs inference on 2 dummy images before those 5. On the other hand, I doubt disk I/O would cause problem in this case since the demo script only loads the 5 jpg files for testing, each roughly 100kB in size.

Anyway, thanks for the suggestion. I might really need to do profiling on the code then…

Hi,

Thanks for your question.

We will check this issue and update information to you later.

Where does the actual model come from then? I would assume it’s many megabytes of data, being loaded from disk, in addition to the test images.

The inference tasks (on 5 test images) are timed after the model has been loaded and the 2 dummy images been processed. So model loading time should not factor in.

Hi,

We have evaluated the performance of official VGG-16 model.
(Since roi-pooling is not supported by tensorRT, we use VGG official model instead.)

TensorRT: 97.64ms
Caffe(faster-RCNN branch): 400.62ms

As a result, please use tensorRT for better performance.
If you are interested in detection problem, it’s recommended to use detectNet.

Sample and introduction can be found here:
Jetson_inference: https://github.com/dusty-nv/jetson-inference
DetectNet: https://devblogs.nvidia.com/parallelforall/detectnet-deep-neural-network-object-detection-digits/