Inference chaining using Deepstream and Triton

clutgen · June 23, 2025, 11:46am

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU) Orin AGX 64Gb
• DeepStream Version 7.1.0
• JetPack Version (valid for Jetson only) 6.2
• TensorRT Version TensorRT v101000
• Triton Inference Server NVIDIA Release 25.05 (build 170551412) - Triton Server Version 2.58.0

Hello,

I have tested a model using an AGX Orin 64gb with triton inference server and deepstream successfully.
Now, I would like to test the same model in 2x AGX Orin 32gb configuration.

For that purpose, I have split the ONNX model in half and convert each halve in TensorRT using trtexec.

I would like to know how to configure the deepstream pipeline when using a onnx model split in half. As I a using triton inference server, I am using the nvinferserverplugin.

Here are the extracts of the config.pbtxtfiles :

name: "model_part1"
platform: "tensorrt_plan"
max_batch_size: 1
default_model_filename: "model_part1.engine"
input [
  {
    name: "input"
    data_type: TYPE_FP32
    format: FORMAT_NCHW
    dims: [ 3, 720, 1280 ]
  }
]
output [
  {
    name: "/conv3_1/Conv_output_0"
    data_type: TYPE_FP32
    dims: [ 64,90,160 ]
  }
]

name: "model_part2"
platform: "tensorrt_plan"
max_batch_size: 1
default_model_filename: "model_part2.engine"
input [
  {
    name: "/conv3_1/Conv_output_0"
    data_type: TYPE_FP32
    dims: [ 64,90,160 ]
  }
]
output [
  {
    name: "output"
    data_type: TYPE_FP32
    dims: [ 5, 720, 1280 ]
  }
]

I have tried to configure the second model as a sgie with option process_mode: PROCESS_MODE_CLIP_OBJECTS, but it doesn’t seems to work.

Could you help me ?

Kind regards

fanzh · June 24, 2025, 2:09am

please refer to \opt\nvidia\deepstream\deepstream\sources\apps\sample_apps\deepstream-test2 for a pgie+sgie nvinferserver sample.
if the app stll can’t work, what are two models used to do respectively? How did you know the the outputs of the first model are correct? How did you do preprocessing for the second model?

clutgen · June 24, 2025, 6:52am

Hello,

please refer to \opt\nvidia\deepstream\deepstream\sources\apps\sample_apps\deepstream-test2 for a pgie+sgie nvinferserver sample.

I tried to get some inspiration from the suggested example, but it is not the same scenario. I don’t have a pgie+sgie, I have only one pgie, but splitted in two smaller parts. In other words, I don’t have 2 different models, it is 1 model that I have split in 2 halves.

Therefore the output of the first model is the same as the input of the second model (64x90x160), but it doesn’t represent anything.

Conceptually, I would like to run the inference on the first model, get the output tensor (no postprocess) and then feed the second model with this tensor (no preprocess). I don’t need anything in between.

if the app stll can’t work, what are two models used to do respectively? How did you know the the outputs of the first model are correct? How did you do preprocessing for the second model?

As I said, I don’t care about the output of the first model, I just need to get it and feed it directly to the second model withtout preprocessing.

The reason for all of this is to create a cluster of triton server and be able to run big models by splitting them into smaller ones.

Thank you

fanzh · June 24, 2025, 7:03am

Thanks for the sharing! please refer to the sample /opt/nvidia/deepstream/deepstream/sources/TritonBackendEnsemble. The SGIE is a Triton ensemble model that has Secondary_VehicleMake, Secondary_VehicleTypes. You can use your two models as an ensemble model.

clutgen · June 24, 2025, 8:44am

The example you are referencing could inded work if both half-models were on the same triton instance.

The idea here it to have each GPU (AGX Orin) its own triton server instance. Therefore, the ensemble model is not a possiblity in my case.

What I want to do is conceptually really simple, I want to take one model, get its output tensor and then feed another model with this tensor (simple model chaining without altering the data in between).

Deepstream as no plugin or configuration for such simple behaviour ?

fanzh · June 24, 2025, 3:05pm

How did you split the ONNX model? what is the output format? are there two ONNX models after splitting?

Topic		Replies	Views
Multi inputs for ONNX models with batch_size=1 DeepStream SDK python , onnx , inference-server-triton , deepstream	12	1181	March 6, 2023
Help with efficient execution of triton ensembles DeepStream SDK inference-server-triton	8	410	March 1, 2024
Agx Orin - Triton inference DeepStream SDK deepstream	17	506	May 6, 2024
Deepstream dGPU Triton Python Bindings OpenCV ONNX DeepStream SDK opencv , inference-server-triton	4	1641	October 12, 2021
Triton on Jetson Orin Jetson AGX Orin	12	2085	January 4, 2024
Examples for Deployment of and Inference with Pretrained Custom PyTorch-Based Models on Jetson Orin Nano Jetson Orin NX pytorch	13	117	May 25, 2025
GPU keep load 99% when running the deepstream_parallel_inference_app DeepStream SDK	45	273	June 20, 2024
Double staged inference with interpendency DeepStream SDK jetson-inference , deepstream	4	32	March 5, 2025
Deepstream API: Infer directly from Primary gst Infer element(pgie) DeepStream SDK	14	381	October 13, 2022
Why after i use tritonserver and when i run deepstream the classification result is different even though using the same weight file? DeepStream SDK deepstream	2	26	January 20, 2025

Inference chaining using Deepstream and Triton

Related topics