Description
When running a Keras Retinanet model using Tensorflow on a Jetson Xavier NX, I get just under 1fps.
After optimising the model with TensorRT, I get 1/4fps.
I expect a higher throughput with TensorRT optimisation than without. Optimising and running on an EC2 instance (G4DN and P3) shows slightly higher throughput (but still not fast enough for real-time video use).
Issues like this have previously been direct back at the repo maintainers (thread) or directed at C+±centric documentation (thread) but debugging these issues seem to be a generic problem and the short section on Python doesn’t explain how to analyse performance of TRT models invoked through Python.
Environment
TensorRT Version: 7.1.3
GPU Type: NVidia Jetson Xavier NX
Nvidia Driver Version: ???
CUDA Version: 10.2.89
CUDNN Version: 8.0.0.180
Operating System + Version: L4T from Jetpack 4.4.1
Python Version (if applicable): 3.6.9
TensorFlow Version (if applicable): 2.3.1+nv20.11
PyTorch Version (if applicable): n/a
Baremetal or Container (if container which image + tag): Baremetal (Jetson)
Relevant Files
Retinanet implementation - fizyr/keras-retinanet: Keras implementation of RetinaNet object detection. (github.com)
Steps To Reproduce
- Clone Retinanet implementation and install dependencies
- Train a model and save to disk
- Convert model to TRT FP16 using
tf.experimental.tensorrt.Converter
and save to disk- Some steps are listed as not being supported for conversion, but presumably these should be no slower than pure Tensorflow
- Build a collection of images
- Load Retinanet model as “model” using Retinanet functions
- For each image, record
time.time()
, predict on the image usingmodel(np.expand_dims(image, axis=0))
, record the newtime.time()
and subtract the two to get elapsed time (1/fps) - Load optimised model as “model” using
tensorflow.python.keras.saving.save.import_model
- For each image, record
time.time()
, predict on the image usingmodel(np.expand_dims(image, axis=0))
, record the newtime.time()
and subtract the two to get elapsed time (1/fps)
Expected result:
- TensorRT model is faster (probably at least 50% faster for inference)
- There are clear and obvious tools and instructions to debug the performance under Python that allow us to understand why the model isn’t performing well, and whether it is due to something simple around CUDA libraries not being used or some other cause
Actual result:
- TensorRT model is four times slower
- There is no obvious way to debug TensorRT within Python