Jetson Orin Nano: running two tensorRT parallel models real time

Hi,

I’ve bought a Jetson Orin Nano (2025, Graphics is NVIDIA Tegra Orin, 7.4 GB RAM, 6 cores), my project io to run inference in parallel with two different tensort models. I’ve tried but I get errors in memory allocation. In principle both models occupy about 40% of my gpu while running separatedly. Moreover if I run them on two different terminals they go in parallel with a deterioration of 5% performances, which is great to me. Is it possible to do on a single pipeline? Do you have any script to suggest, in order to parallelize models, whithout generating any error, this thing should go real time so I need a very deep control on device performances, threading and so on? I am new to this kind of platforms, so I am probably missing somethinig.

Thanks,

Simone

*** Please note that this reply is generated by LLM automatically ***

To address your question, I’ll provide a comprehensive answer, comparing the relevant information from the context.

Firstly, it’s essential to understand the Jetson Orin Nano’s specifications and capabilities. The Jetson Orin Nano series is part of the NVIDIA Jetson Orin NX series, which includes various modules with different memory configurations. In your case, you have a Jetson Orin Nano with 4GB of RAM, 6 cores, and an NVIDIA Tegra Orin GPU.

To run inference in parallel with two different tensor models, you’ll need to optimize your code to utilize the available resources efficiently. Since both models occupy about 40% of your GPU while running separately, you can try to run them in parallel using a single pipeline.

However, as you’ve experienced, running multiple models in parallel can lead to memory allocation errors. To overcome this, you can consider the following strategies:

  1. Model pruning: Prune your models to reduce their size and memory footprint. This can be done using techniques like weight sharing, knowledge distillation, or model quantization.
  2. Model parallelism: Split your models into smaller sub-models and run them in parallel using multiple threads or processes. This can help reduce the memory requirements for each model.
  3. Batching: Increase the batch size for your models to reduce the number of iterations and memory allocations.
  4. Memory optimization: Optimize your code to minimize memory allocations and deallocations. Use techniques like memory pooling or caching to reduce memory fragmentation.

To parallelize your models without generating errors, you can use the following scripts and tools:

  1. NVIDIA TensorRT: TensorRT is a software development kit (SDK) for optimizing and deploying deep learning models on NVIDIA GPUs. It provides tools for model pruning, quantization, and optimization.
  2. NVIDIA Deep Learning SDK: The Deep Learning SDK provides a set of tools and libraries for building and deploying deep learning models on NVIDIA GPUs. It includes libraries like cuDNN, cuBLAS, and TensorRT.
  3. PyTorch: PyTorch is a popular deep learning framework that provides built-in support for parallelizing models using multiple GPUs or CPU cores. You can use PyTorch’s DataParallel module to parallelize your models.

Here’s an example PyTorch script that demonstrates how to parallelize two models using multiple GPUs:

import torch
import torch.nn as nn
import torch.optim as optim

# Define your models
class ModelA(nn.Module):
    def __init__(self):
        super(ModelA, self).__init__()
        self.fc1 = nn.Linear(784, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

class ModelB(nn.Module):
    def __init__(self):
        super(ModelB, self).__init__()
        self.fc1 = nn.Linear(784, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# Initialize your models and data
model_a = ModelA()
model_b = ModelB()
data = torch.randn(100, 784)

# Define your parallelization strategy
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model_a = model_a.to(device)
model_b = model_b.to(device)

# Parallelize your models using multiple GPUs
if torch.cuda.device_count() > 1:
    model_a = nn.DataParallel(model_a)
    model_b = nn.DataParallel(model_b)

# Run your models in parallel
output_a = model_a(data)
output_b = model_b(data)

This script demonstrates how to parallelize two models using multiple GPUs. You can modify the script to suit your specific requirements and optimize your code for better performance.

In summary, to run inference in parallel with two different tensor models on a Jetson Orin Nano, you’ll need to optimize your code to utilize the available resources efficiently. You can use techniques like model pruning, model parallelism, batching, and memory optimization to reduce memory allocations and errors. Additionally, you can use scripts and tools like NVIDIA TensorRT, NVIDIA Deep Learning SDK, and PyTorch to parallelize your models and optimize their performance.

*** Please give the thumbs up if you get answers you like. Or provide feedback to help us improve the answer. ***

Hi,

Would you mind sharing more information about your use case?

Do you use the same image for two models?
In such a case, it’s recommended to use the same decoder/pre-processing to save resources.

What’s the expected output? Is it a labeled image that depends on the two outputs?

Thanks.

Hi,

I have to grab an image from a streaming input, resize it, then apply a detection model, then resize the image again according to the output of the pose estimation model , then extract the keypoints via a pose estimation model. The input images of the model are different, and so the outputs. do you have any suggestions?
Recup:
Resize Img→ detection → Resize according detection → Pose estimation→ keypoints.

Thanks in advance

Hi,

Based on your use case, for the same image, the model cannot run in parallel due to the dependency.
But you should be able to do this for the different frame: ex. Detection for #2 frame and post estimation for #1 frame can be run in parallel.

Please check our Deepstream SDK:

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.