Jetson Orin Nano: running two tensorRT parallel models real time

torsim · November 17, 2025, 8:45am

Hi,

I’ve bought a Jetson Orin Nano (2025, Graphics is NVIDIA Tegra Orin, 7.4 GB RAM, 6 cores), my project io to run inference in parallel with two different tensort models. I’ve tried but I get errors in memory allocation. In principle both models occupy about 40% of my gpu while running separatedly. Moreover if I run them on two different terminals they go in parallel with a deterioration of 5% performances, which is great to me. Is it possible to do on a single pipeline? Do you have any script to suggest, in order to parallelize models, whithout generating any error, this thing should go real time so I need a very deep control on device performances, threading and so on? I am new to this kind of platforms, so I am probably missing somethinig.

Thanks,

Simone

carolyuu · November 17, 2025, 11:32am

*** Please note that this reply is generated by LLM automatically ***

To address your question, I’ll provide a comprehensive answer, comparing the relevant information from the context.

Firstly, it’s essential to understand the Jetson Orin Nano’s specifications and capabilities. The Jetson Orin Nano series is part of the NVIDIA Jetson Orin NX series, which includes various modules with different memory configurations. In your case, you have a Jetson Orin Nano with 4GB of RAM, 6 cores, and an NVIDIA Tegra Orin GPU.

To run inference in parallel with two different tensor models, you’ll need to optimize your code to utilize the available resources efficiently. Since both models occupy about 40% of your GPU while running separately, you can try to run them in parallel using a single pipeline.

However, as you’ve experienced, running multiple models in parallel can lead to memory allocation errors. To overcome this, you can consider the following strategies:

Model pruning: Prune your models to reduce their size and memory footprint. This can be done using techniques like weight sharing, knowledge distillation, or model quantization.
Model parallelism: Split your models into smaller sub-models and run them in parallel using multiple threads or processes. This can help reduce the memory requirements for each model.
Batching: Increase the batch size for your models to reduce the number of iterations and memory allocations.
Memory optimization: Optimize your code to minimize memory allocations and deallocations. Use techniques like memory pooling or caching to reduce memory fragmentation.

To parallelize your models without generating errors, you can use the following scripts and tools:

NVIDIA TensorRT: TensorRT is a software development kit (SDK) for optimizing and deploying deep learning models on NVIDIA GPUs. It provides tools for model pruning, quantization, and optimization.
NVIDIA Deep Learning SDK: The Deep Learning SDK provides a set of tools and libraries for building and deploying deep learning models on NVIDIA GPUs. It includes libraries like cuDNN, cuBLAS, and TensorRT.
PyTorch: PyTorch is a popular deep learning framework that provides built-in support for parallelizing models using multiple GPUs or CPU cores. You can use PyTorch’s DataParallel module to parallelize your models.

Here’s an example PyTorch script that demonstrates how to parallelize two models using multiple GPUs:

import torch
import torch.nn as nn
import torch.optim as optim

# Define your models
class ModelA(nn.Module):
    def __init__(self):
        super(ModelA, self).__init__()
        self.fc1 = nn.Linear(784, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

class ModelB(nn.Module):
    def __init__(self):
        super(ModelB, self).__init__()
        self.fc1 = nn.Linear(784, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# Initialize your models and data
model_a = ModelA()
model_b = ModelB()
data = torch.randn(100, 784)

# Define your parallelization strategy
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model_a = model_a.to(device)
model_b = model_b.to(device)

# Parallelize your models using multiple GPUs
if torch.cuda.device_count() > 1:
    model_a = nn.DataParallel(model_a)
    model_b = nn.DataParallel(model_b)

# Run your models in parallel
output_a = model_a(data)
output_b = model_b(data)

This script demonstrates how to parallelize two models using multiple GPUs. You can modify the script to suit your specific requirements and optimize your code for better performance.

In summary, to run inference in parallel with two different tensor models on a Jetson Orin Nano, you’ll need to optimize your code to utilize the available resources efficiently. You can use techniques like model pruning, model parallelism, batching, and memory optimization to reduce memory allocations and errors. Additionally, you can use scripts and tools like NVIDIA TensorRT, NVIDIA Deep Learning SDK, and PyTorch to parallelize your models and optimize their performance.

*** Please give the thumbs up if you get answers you like. Or provide feedback to help us improve the answer. ***

AastaLLL · November 18, 2025, 6:16am

Hi,

Would you mind sharing more information about your use case?

Do you use the same image for two models?
In such a case, it’s recommended to use the same decoder/pre-processing to save resources.

What’s the expected output? Is it a labeled image that depends on the two outputs?

Thanks.

torsim · November 18, 2025, 8:17am

Hi,

I have to grab an image from a streaming input, resize it, then apply a detection model, then resize the image again according to the output of the pose estimation model , then extract the keypoints via a pose estimation model. The input images of the model are different, and so the outputs. do you have any suggestions?
Recup:
Resize Img→ detection → Resize according detection → Pose estimation→ keypoints.

Thanks in advance

AastaLLL · November 19, 2025, 6:19am

Hi,

Based on your use case, for the same image, the model cannot run in parallel due to the dependency.
But you should be able to do this for the different frame: ex. Detection for #2 frame and post estimation for #1 frame can be run in parallel.

Please check our Deepstream SDK:

Thanks.

system · December 16, 2025, 1:33am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Examples for Deployment of and Inference with Pretrained Custom PyTorch-Based Models on Jetson Orin Nano Jetson Orin NX pytorch	13	488	May 25, 2025
Stuttering Video & Fluctuating GPU Usage in Multi-Model Inference on Jetson Orin Nano Jetson Orin Nano jetson	3	53	September 26, 2025
Performance analysis on Jetson Orin Nano 8GB Jetson Nano cudnn	2	553	June 4, 2024
[Jetson Orin Nano] RuntimeError: FIND was unable to find an engine to execute this computation after trying 0 plans Jetson Orin Nano cuda , jetson	8	212	August 27, 2025
Multi-model inference paralle on jetson agx orin TensorRT	2	143	March 24, 2025
Run inference with model exported by Jetson Orin Nano 8GB on Jetson Orin Nano 4GB Jetson Orin Nano tensorrt , yolo	5	1053	August 3, 2023
How to install torch_tensorrt Jetson Orin Nano tensorrt	3	144	August 20, 2025
Jetson Orin Nano device hangs a lot Jetson Orin Nano jetson-inference	4	283	May 8, 2024
Model Performance Request Jetson Orin Nano jetson-inference	3	327	February 20, 2024
Best way to deploy object detection model in jetson orin nano Jetson Orin Nano jetson-inference	2	1493	March 4, 2024

Jetson Orin Nano: running two tensorRT parallel models real time

Related topics