Options for optimising custom TFRT tiny YOLOv4 implementation to improve live inference speed on Nano

I’ve created a custom tiny YOLOv4 Tensorflow RT model, which I’m running on a 4GB Nano development board using this Python repository, for live inference.

My input size is 416x416 (needed because I’m trying to detect relatively small objects within the frame), and I’m using 8-bit integers. The rest of my setup details can be seen in this jtop output:

I’m only managing to achieve a maximum throughput of 2-2.3fps, and need to improve this to at least 15fps.

I was wondering if there are any further steps I can take (beyond using TFRT, tiny YOLO and 8-bit integers) to improve the fps on the Nano, and what speed improvements I might expect by moving up to one of the more powerful Jetson boards?