Jetson Nano Out of Memory running TRT Model

Hello!

Context

I’m developing an object detection application in my workstation GPU, in order to later deploy it on a Jetson Nano.

Platform: GTX1080 | Jetson Nano DevKit 4GB
DeepStream: 6.0
Triton: 21.08
TRT: 8.0
CUDA: 11
JetPack: 4.6
Docker: nvcr.io/nvidia/deepstream:6.0-triton

  1. I have this object detection model (SSD mobilenet 0.3Mpx image input) optimized from TensorFlow through TRT, precision FP16. The TRT model runs a bit faster on my workstation GPU (GTX1080) after the optimization. (10fps > 11.5fps)
    Just mention, I can run the model correctly if I export it with TFv1. I don’t know why it fails on DeepStream when exported with TF2, but that’s another topic.
    I’ve been following this: tf-trt-user-guide

  2. Now, I want to take the TRT model and run it on a Jetson Nano.
    Previously, the original TF model (no TRT) was able to run on the Nano, extremely slow, 2 seconds per frame, but it run.

  3. After some tweaks in the config files (deepstream, triton), I can start the pipeline, then Triton server tries to serve the model. Triton is able to find the model, there are no errors related to shape, dimensions, etc, those were fixed.

  4. At some point the debug info stops, CPU and memory usage hit 100%, and the system is unresponsive for 5 minutes. At the end, the process ends killed because Out of Memory, confirmed from dmesg output.

Questions

  1. What is exactly doing Triton Server that takes all system resources? Can I move anything from that to the workstation?
  2. I already increased swap to 4GB, same result.
  3. What can I try to make it work? Different TRT export parameters?
  4. Does it makes sense to keep trying? Will the model run much faster on the Nano than before?
  5. Is there a way to further optimize the model to enable real-time object detection.?

Thanks as always to the Nvidia forum team, providing information and solutions.

Hi,

1. If you want to deploy the model on Jetson, these resources need to be allocated on the device directly.

2. swap is CPU memory. It won’t increase GPU memory amount.

3. Could you measure the total memory usage on desktop first.
Since Nano has only 4GiB memory, it has some limitation on a complicated model.

4. Please check below for the inference benchmark of Jeston.
Usually, it’s recommended to use pure TensorRT for less memory and better performance.

5. You can reproduce above performance with the source code in the below GitHub:

Thanks.

Hi @AastaLLL !
Thanks for the thorough response. I will try all the steps and mark the topic as solved soon.

However, I have one question left.
Why is it that before optimization, the TensorFlow model was able to be loaded and run by Triton server, and then after optimization, is not possible to be loaded because of Out of Memory?
Is Triton Server performing a hardware-specific optimization step?

In addition, the TensorFlow graph definition file size is larger after the optimization, is that because it stores additional TRT weights? Does that cause extra memory use?

PS:
Correct me if I’m wrong, Jetson Nano has unified memory, so if swap memory frees CPU memory, there is more room for GPU?

Thanks!!!

Hi,

If you have applied the TF-TRT optimization with Triton, then yes, it does some hardware-specific optimization.
Please note that if TF-TRT is used, Triton needs to load both TensorFlow and TensorRT libraries.

Usually, the file size increases since it needs to save the TensorFlow staff as well as TensorRT’s.
There is no obvious relation between memory usage and file size.
It’s more related to the library you used and the model depth.

For the swap memory, yes it is.
But the system should prefer to use physical memory first.
It won’t be easy to control an allocation from a swap or physical memory.

Thanks.

1 Like

@AastaLLL Thank you so much, that is the answer I’ve been looking for.
Thanks!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.