PGIE total output not equal to SGIE output

zgjja · December 26, 2024, 9:23am

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU): GPU
• DeepStream Version: 7.0
• TensorRT Version align with ds-7.0 docker image
• NVIDIA GPU Driver Version (valid for GPU only) 565
• Issue Type( questions, new requirements, bugs) maybe bugs
• How to reproduce the issue ?

Context

I want to use nvinferserver’s PGIE+SGIE to build a top-down pose estimation pipeline, PGIE uses yolo11, SGIE uses a top-down pose estimator.

Details

I attach object_meta data into each frame_meta, and also change the num_obj_meta member.
in my expectation, the SGIE input/output batch-size should equal to the sum of num_obj_meta in each frame that PGIE transfer, e.g., 2 frames, num_obj_meta=3 and num_obj_meta=4, the SGIE’s nvds_frame_meta_list length should be 7 but the actual is not in my test.
I use batch meta lock and bInferDone to manage the synchronization, not sure if this is enough to avoid race condition

Assumption 2 is importtant for me because i need to match the SGIE result for each frame

PGIE config

config.pbtxt

name: "YOLO11-Det"
platform: "tensorrt_plan"
default_model_filename: "end2end.engine"
max_batch_size: 0

input [
  {
    name: "input"
    data_type: TYPE_FP32
    dims: [ -1, 3, 640, 640 ]
  }
]

output [
  {
    name: "dets"
    data_type: TYPE_FP32
    dims: [ 128, 7 ]
  }
]

instance_group [
  {
    count: 1
    kind: KIND_GPU
    gpus: [ 0 ]
  }
]

version_policy {
  specific: { versions: [1]}
}

config for nvinferserver

config file

input_control {
  async_mode: false
  process_mode: PROCESS_MODE_FULL_FRAME
  operate_on_gie_id: -1
  interval: 0
}

infer_config {
  gpu_ids: [0]
  backend {
    triton {
      model_name: "YOLO11-Det"
      version: -1
      model_repo {
        root: "xxx"
        backend_configs: [
          {
            backend: "tensorrt_plan"
          }
        ]
        strict_model_config: false
        min_compute_capacity: 8.0
        log_level: 2
      }
    }
    inputs [
      {
        name: "input"
        dims: [ 3, 640, 640 ]
        data_type: TENSOR_DT_FP32
      }
    ]
    outputs [
      {
        name: "dets"
        max_buffer_bytes: 4096
      }
    ]
    output_mem_type: MEMORY_TYPE_CPU
  }
  preprocess {
    network_format: IMAGE_FORMAT_BGR
    tensor_order: TENSOR_ORDER_LINEAR
    maintain_aspect_ratio: 1
    frame_scaling_filter: 1
    symmetric_padding: 0
    normalize {
      scale_factor: 0.003921569
      channel_offsets: [0, 0, 0]
    }
  }
  postprocess {
    other {}
  }
  extra {
    custom_process_funcion: "YOLO11Det"
    output_buffer_pool_size: 128
  }
  custom_lib {
    path: "xxx.so"
  }
}

config for plugin

unique-id: 1
process-mode: 1
input-tensor-meta: 0
config-file-path: xxx.txt

SGIE config

config.pbtxt

name: "RTMPose-m"
platform: "tensorrt_plan"

default_model_filename: "end2end.engine"
max_batch_size: 32

input [
  {
    name: "input"
    data_type: TYPE_FP32
    dims: [ 3, 384, 288 ]
  }
]

output [
  {
    name: "simcc_x"
    data_type: TYPE_FP32
    dims: [ 26, -1 ]
  },
  {
    name: "simcc_y"
    data_type: TYPE_FP32
    dims: [ 26, -1 ]
  }
]

instance_group [
  {
    count: 1
    kind: KIND_GPU
    gpus: [ 0 ]
  }
]

version_policy {
  specific: { versions: [1]}
}

dynamic_batching {
  preferred_batch_size: [ 32 ]
  max_queue_delay_microseconds: 1500
}

nvinferserver config

config file

input_control {
  async_mode: false
  operate_on_gie_id: 1
  process_mode: PROCESS_MODE_CLIP_OBJECTS
  secondary_reinfer_interval: 0
}

infer_config {
  gpu_ids: [0]
  backend {
    triton {
      model_name: "RTMPose-m"
      version: -1
      model_repo {
        root: "xxx"
        backend_configs: [
          {
            backend: "tensorrt_plan"
          }
        ]
        strict_model_config: false
        log_level: 2
      }
    }
    inputs [
      {
        name: "input"
        dims: [3, 384, 288]
        data_type: TENSOR_DT_FP32
      }
    ]
    outputs [
      {
        name: "simcc_x"
      },
      {
        name: "simcc_y"
      }
    ]
    output_mem_type: MEMORY_TYPE_CPU
  }
  preprocess {
    network_format: IMAGE_FORMAT_BGR
    tensor_order: TENSOR_ORDER_LINEAR
    maintain_aspect_ratio: 1
    symmetric_padding: 0
    normalize {
      scale_factor: 0.017352074
      channel_offsets: [ 123.675, 116.28, 103.53 ]
    }
  }
  postprocess {
    other {}
  }
  extra {
    custom_process_funcion: "RTMPose"
  }
  custom_lib {
    path: "xxx.so"
  }
}

output_control {
  output_tensor_meta: true
}

config for plugin

unique-id: 2
process-mode: 2
input-tensor-meta: 0
infer-on-gie-id: 1
config-file-path: xxx.txt

zgjja · December 27, 2024, 9:07am

Update: I have now confirmed that the nvinferser will set the max_batch_size to 1 while max_batch_size<=0 provided by user, so in my case, the result is wrong.

But as the tritonserver’s doc said, it is valid to set max_batch_size=0, especially for the dynamic input shape with static output shape, so how can i do to apply this correctly?

Fiona.Chen · December 30, 2024, 1:29am

Since you are using “tensorrt_plan”, how did you generate the engine file for you said your model is “dynamic input shape with static output shape”?

zgjja · December 30, 2024, 6:07am

here is the network info:

For example, i am using Yolo with NMS layer, which is already supported in both ONNX and Trt(INMSLayer). the original output of them are both “runtime” dynamic with shape [M, 3], M is the valid number of bboxes calculated in NMS-op.

Then I found some source saying TensorRT would require some extra callbacks to get that M so as to allocate the output buffer, which means i need to implement a allocator. So my workaround is to construct a static zero tensor (e.g. with shape [128, 7], 7 means: [batch_id, class_id, cx, cy, w, h, confidence]) in python code, and assign the valid results to that static tensor.

In my test, this works fine for Python API (both onnxtuntime-gpu and TensorRT).

I also read the tritonserver’s doc, this situation is also fully supported, and i confirmed this with both Python and C++ API.

So as far as i know, all this problem is caused by TrtisServer in deepstream, because it uses batch-size parameter coming from triton-server config as reference to build input buffers from GstBuffer and run inference, so in that design, 0 is not supported, while in triton-server’s design, max_batch_size=0 is valid.

In conclusion, i can confirm 1 possible solution: use a third party’s batched-version of NMS-op to align the dynamic inputs/outputs, but i don’t think this is a ideal way. I also want to know if there is any way i can do to modify the output of current NMS layer into a output shape like [N, M, 5] (M=max_bbox_number)? or any better way? (My major purpose is to not writing NMS-op on my own)

Fiona.Chen · December 30, 2024, 6:27am

Your original model is ONNX model. The tool and command to generate the TensorRT engine(plan) file?

zgjja · December 30, 2024, 6:43am

I use this code to generate .engine file:

github.com

open-mmlab/mmdeploy/blob/main/mmdeploy/backend/tensorrt/utils.py#L98


      
              if version is None or pattern.match(version) is None:
                  try:
                      import torch
                      version = torch.version.cuda
                  except Exception:
                      pass
          
              return version
          
          
          def from_onnx(onnx_model: Union[str, onnx.ModelProto],
                        output_file_prefix: str,
                        input_shapes: Dict[str, Sequence[int]],
                        max_workspace_size: int = 0,
                        fp16_mode: bool = False,
                        int8_mode: bool = False,
                        int8_param: Optional[dict] = None,
                        device_id: int = 0,
                        log_level: trt.Logger.Severity = trt.Logger.ERROR,
                        **kwargs) -> trt.ICudaEngine:
              """Create a tensorrt engine from ONNX.

my code:

    engine_name = 'yolo11s-3060'
    from_onnx('yolo11s.onnx',
              engine_name,
              fp16_mode=True,
              input_shapes={
                  'input': {
                      "min_shape": [1, 3, 640, 640],
                      "opt_shape": [4, 3, 640, 640],
                      "max_shape": [8, 3, 640, 640]
                  }
              },
              log_level=trt.Logger.INFO,
              max_workspace_size=16 << 30)

Fiona.Chen · December 30, 2024, 7:05am

Such code is for explicit batch mmdeploy/mmdeploy/backend/tensorrt/utils.py at main · open-mmlab/mmdeploy. Why don’t you use “trtexec” tool in TensorRT package to generate TensorRT engine?

Fiona.Chen · December 30, 2024, 7:08am

Please consult in the TensorRT forum for how to write TensorRT python code to convert the dynamic onnx to TensorRT engine. Latest Deep Learning (Training & Inference)/TensorRT topics - NVIDIA Developer Forums

Or you use “trtexec” tool.

zgjja · January 3, 2025, 10:33am

Ok, thanks, my problem basically solved now.

But as what you posted, I don’t think this would make any difference, because this ENUM as you mentioned only support EXPLICIT_BATCH (as for Trt 8.6.1) now, nvinfer.h’s code:

enum class NetworkDefinitionCreationFlag : int32_t
{
    //! Mark the network to be an explicit batch network.
    //! Dynamic shape support requires that the kEXPLICIT_BATCH flag is set.
    //! With dynamic shapes, any of the input dimensions can vary at run-time,
    //! and there are no implicit dimensions in the network specification.
    //! Varying dimensions are specified by using the wildcard dimension value -1.
    kEXPLICIT_BATCH = 0,

    //! Deprecated. This flag has no effect now, but is only kept for backward compatability.
    //!
    kEXPLICIT_PRECISION TRT_DEPRECATED_ENUM = 1,
};

I want to know if you are wrong on this one?

My trtexec command:

trtexec --onnx=yolo11s.onnx --fp16 --minShapes=input:1x3x640x640 --optShapes=input:4x3x640x640 --maxShapes=input:8x3x640x640 --device=0 --iterations=200 --warmUp=10 --saveEngine=trtexec_yolo11.engine

Fiona.Chen · January 6, 2025, 2:02am

trtexec is open source, please refer to TensorRT/samples/trtexec at release/8.6 · NVIDIA/TensorRT for how to parse and generate engine file from onnx model.

This is not aligned with your engine file. You generate the batch size 8 engine file while configure it as dynamic batch.

system · January 20, 2025, 2:02am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
A confused problem about Pytorch -- ONNX -- TensorRT -- Deepstream SGIE DeepStream SDK	13	977	October 4, 2022
DeepStream net-scale-factor on SGIE DeepStream SDK deepstream	12	2514	November 8, 2021
SGIE Error DeepStream SDK	5	319	February 19, 2024
SGIE Classifier output inconsistent in deepstream app DeepStream SDK tensorrt , deepstream	13	59	December 5, 2024
Process the detected PGIE objects to match the input size of the SGIE DeepStream SDK deepstream	5	22	December 27, 2024
ONNX to TensorRT Python module doesn't generate dynamic batch size engine TensorRT tensorrt , cudnn , onnx	3	1071	October 20, 2023
Some PyTorch model with slicing operation fails on inference TensorRT tensorrt , pytorch , onnx , deepstream	2	1442	January 7, 2022
Inaccurate GIE classification after converting to TensorRT DeepStream SDK tensorrt , onnx , caffe , deepstream	18	481	July 22, 2024
Input shape axis 0 must equal 8, got shape [5,600,1024,3] DeepStream SDK tf-trt	7	1599	March 23, 2021
Parsing custom tensorflow model DeepStream SDK	31	570	September 4, 2023

PGIE total output not equal to SGIE output

Context

Details

Other Questions

PGIE config

config.pbtxt

config for nvinferserver

config file

config for plugin

SGIE config

config.pbtxt

nvinferserver config

config file

config for plugin

Related topics