I followed AastaLLL’s instruction in the ‘Caffe failed with py-faster-rcnn demo.py on TX1’ post, and was able to build and run py-faster-rcnn demo script on Jetson TX1.
However, comparing to a GeForce GPU card, the inference performance on JTX1 is lackluster. More specifically, it takes roughly 1.8s to processing each image in the py-faster-rcnn demo on JTX1. In contrast, that takes only ~0.09s on my x64 PC with a GTX-1080 graphics card. I have tried to force JTX1 to always run at maximum clock speeds by running the ~/jetson_clocks.sh script, but that doesn’t help too much. For other DNN/CNN tasks, I typically see less than 10X performance difference between JTX1 and the GTX-1080 PC. But for this py-faster-rcnn case, JTX1 is really falling behind.
Are there any suggestions about how to improve py-faster-caffe inference performance on JTX1? Thanks.
The Jetson eMMC disk I/O is much slower than a typical desktop SSD.
Maybe the problem is at least partly in the load-and-setup part of the code, rather than the inference part of the code?
If I were you, I’d instrument the code to time itself, once the model is fully loaded and ready to run.
The reason for this is that, in a finished embedded system, you’d load the model just once, but you’d repeatedly run the inference as a service to the rest of the system.
The py-faster-rcnn demo.py script does load the model only once, and then runs inference on 5 test images consecutively. In fact, it runs inference on 2 dummy images before those 5. On the other hand, I doubt disk I/O would cause problem in this case since the demo script only loads the 5 jpg files for testing, each roughly 100kB in size.
Anyway, thanks for the suggestion. I might really need to do profiling on the code then…
The inference tasks (on 5 test images) are timed after the model has been loaded and the 2 dummy images been processed. So model loading time should not factor in.