PGIE total output not equal to SGIE output

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU): GPU
• DeepStream Version: 7.0
• TensorRT Version align with ds-7.0 docker image
• NVIDIA GPU Driver Version (valid for GPU only) 565
• Issue Type( questions, new requirements, bugs) maybe bugs
• How to reproduce the issue ?

Context

I want to use nvinferserver’s PGIE+SGIE to build a top-down pose estimation pipeline, PGIE uses yolo11, SGIE uses a top-down pose estimator.

Details

  1. I attach object_meta data into each frame_meta, and also change the num_obj_meta member.
  2. in my expectation, the SGIE input/output batch-size should equal to the sum of num_obj_meta in each frame that PGIE transfer, e.g., 2 frames, num_obj_meta=3 and num_obj_meta=4, the SGIE’s nvds_frame_meta_list length should be 7 but the actual is not in my test.
  3. I use batch meta lock and bInferDone to manage the synchronization, not sure if this is enough to avoid race condition

Assumption 2 is importtant for me because i need to match the SGIE result for each frame

Other Questions

There is still some confuse for me, nvdsinferserver attach all meta data into PGIE frame_meta, but when debug inside the SGIE’s inferenceDone(), i noticed SGIE did not do this at all, does the meta data in SGIE actually work under the nvinferserver’s hood? especially for bInferDone?


• Requirement details( This is for new requirement. Including the module name-for which plugin or for which sample application, the function description)

PGIE config

config.pbtxt

name: "YOLO11-Det"
platform: "tensorrt_plan"
default_model_filename: "end2end.engine"
max_batch_size: 0

input [
  {
    name: "input"
    data_type: TYPE_FP32
    dims: [ -1, 3, 640, 640 ]
  }
]

output [
  {
    name: "dets"
    data_type: TYPE_FP32
    dims: [ 128, 7 ]
  }
]

instance_group [
  {
    count: 1
    kind: KIND_GPU
    gpus: [ 0 ]
  }
]

version_policy {
  specific: { versions: [1]}
}

config for nvinferserver

config file

input_control {
  async_mode: false
  process_mode: PROCESS_MODE_FULL_FRAME
  operate_on_gie_id: -1
  interval: 0
}

infer_config {
  gpu_ids: [0]
  backend {
    triton {
      model_name: "YOLO11-Det"
      version: -1
      model_repo {
        root: "xxx"
        backend_configs: [
          {
            backend: "tensorrt_plan"
          }
        ]
        strict_model_config: false
        min_compute_capacity: 8.0
        log_level: 2
      }
    }
    inputs [
      {
        name: "input"
        dims: [ 3, 640, 640 ]
        data_type: TENSOR_DT_FP32
      }
    ]
    outputs [
      {
        name: "dets"
        max_buffer_bytes: 4096
      }
    ]
    output_mem_type: MEMORY_TYPE_CPU
  }
  preprocess {
    network_format: IMAGE_FORMAT_BGR
    tensor_order: TENSOR_ORDER_LINEAR
    maintain_aspect_ratio: 1
    frame_scaling_filter: 1
    symmetric_padding: 0
    normalize {
      scale_factor: 0.003921569
      channel_offsets: [0, 0, 0]
    }
  }
  postprocess {
    other {}
  }
  extra {
    custom_process_funcion: "YOLO11Det"
    output_buffer_pool_size: 128
  }
  custom_lib {
    path: "xxx.so"
  }
}

config for plugin

unique-id: 1
process-mode: 1
input-tensor-meta: 0
config-file-path: xxx.txt

SGIE config

config.pbtxt

name: "RTMPose-m"
platform: "tensorrt_plan"

default_model_filename: "end2end.engine"
max_batch_size: 32

input [
  {
    name: "input"
    data_type: TYPE_FP32
    dims: [ 3, 384, 288 ]
  }
]

output [
  {
    name: "simcc_x"
    data_type: TYPE_FP32
    dims: [ 26, -1 ]
  },
  {
    name: "simcc_y"
    data_type: TYPE_FP32
    dims: [ 26, -1 ]
  }
]

instance_group [
  {
    count: 1
    kind: KIND_GPU
    gpus: [ 0 ]
  }
]

version_policy {
  specific: { versions: [1]}
}

dynamic_batching {
  preferred_batch_size: [ 32 ]
  max_queue_delay_microseconds: 1500
}

nvinferserver config

config file

input_control {
  async_mode: false
  operate_on_gie_id: 1
  process_mode: PROCESS_MODE_CLIP_OBJECTS
  secondary_reinfer_interval: 0
}

infer_config {
  gpu_ids: [0]
  backend {
    triton {
      model_name: "RTMPose-m"
      version: -1
      model_repo {
        root: "xxx"
        backend_configs: [
          {
            backend: "tensorrt_plan"
          }
        ]
        strict_model_config: false
        log_level: 2
      }
    }
    inputs [
      {
        name: "input"
        dims: [3, 384, 288]
        data_type: TENSOR_DT_FP32
      }
    ]
    outputs [
      {
        name: "simcc_x"
      },
      {
        name: "simcc_y"
      }
    ]
    output_mem_type: MEMORY_TYPE_CPU
  }
  preprocess {
    network_format: IMAGE_FORMAT_BGR
    tensor_order: TENSOR_ORDER_LINEAR
    maintain_aspect_ratio: 1
    symmetric_padding: 0
    normalize {
      scale_factor: 0.017352074
      channel_offsets: [ 123.675, 116.28, 103.53 ]
    }
  }
  postprocess {
    other {}
  }
  extra {
    custom_process_funcion: "RTMPose"
  }
  custom_lib {
    path: "xxx.so"
  }
}

output_control {
  output_tensor_meta: true
}

config for plugin

unique-id: 2
process-mode: 2
input-tensor-meta: 0
infer-on-gie-id: 1
config-file-path: xxx.txt

Update: I have now confirmed that the nvinferser will set the max_batch_size to 1 while max_batch_size<=0 provided by user, so in my case, the result is wrong.

But as the tritonserver’s doc said, it is valid to set max_batch_size=0, especially for the dynamic input shape with static output shape, so how can i do to apply this correctly?

Since you are using “tensorrt_plan”, how did you generate the engine file for you said your model is “dynamic input shape with static output shape”?

here is the network info:

For example, i am using Yolo with NMS layer, which is already supported in both ONNX and Trt(INMSLayer). the original output of them are both “runtime” dynamic with shape [M, 3], M is the valid number of bboxes calculated in NMS-op.

Then I found some source saying TensorRT would require some extra callbacks to get that M so as to allocate the output buffer, which means i need to implement a allocator. So my workaround is to construct a static zero tensor (e.g. with shape [128, 7], 7 means: [batch_id, class_id, cx, cy, w, h, confidence]) in python code, and assign the valid results to that static tensor.

In my test, this works fine for Python API (both onnxtuntime-gpu and TensorRT).

I also read the tritonserver’s doc, this situation is also fully supported, and i confirmed this with both Python and C++ API.

So as far as i know, all this problem is caused by TrtisServer in deepstream, because it uses batch-size parameter coming from triton-server config as reference to build input buffers from GstBuffer and run inference, so in that design, 0 is not supported, while in triton-server’s design, max_batch_size=0 is valid.

In conclusion, i can confirm 1 possible solution: use a third party’s batched-version of NMS-op to align the dynamic inputs/outputs, but i don’t think this is a ideal way. I also want to know if there is any way i can do to modify the output of current NMS layer into a output shape like [N, M, 5] (M=max_bbox_number)? or any better way? (My major purpose is to not writing NMS-op on my own)

Your original model is ONNX model. The tool and command to generate the TensorRT engine(plan) file?

I use this code to generate .engine file:

my code:

    engine_name = 'yolo11s-3060'
    from_onnx('yolo11s.onnx',
              engine_name,
              fp16_mode=True,
              input_shapes={
                  'input': {
                      "min_shape": [1, 3, 640, 640],
                      "opt_shape": [4, 3, 640, 640],
                      "max_shape": [8, 3, 640, 640]
                  }
              },
              log_level=trt.Logger.INFO,
              max_workspace_size=16 << 30)

Such code is for explicit batch mmdeploy/mmdeploy/backend/tensorrt/utils.py at main · open-mmlab/mmdeploy. Why don’t you use “trtexec” tool in TensorRT package to generate TensorRT engine?

Please consult in the TensorRT forum for how to write TensorRT python code to convert the dynamic onnx to TensorRT engine. Latest Deep Learning (Training & Inference)/TensorRT topics - NVIDIA Developer Forums

Or you use “trtexec” tool.

Ok, thanks, my problem basically solved now.

But as what you posted, I don’t think this would make any difference, because this ENUM as you mentioned only support EXPLICIT_BATCH (as for Trt 8.6.1) now, nvinfer.h’s code:

enum class NetworkDefinitionCreationFlag : int32_t
{
    //! Mark the network to be an explicit batch network.
    //! Dynamic shape support requires that the kEXPLICIT_BATCH flag is set.
    //! With dynamic shapes, any of the input dimensions can vary at run-time,
    //! and there are no implicit dimensions in the network specification.
    //! Varying dimensions are specified by using the wildcard dimension value -1.
    kEXPLICIT_BATCH = 0,

    //! Deprecated. This flag has no effect now, but is only kept for backward compatability.
    //!
    kEXPLICIT_PRECISION TRT_DEPRECATED_ENUM = 1,
};

I want to know if you are wrong on this one?

My trtexec command:

trtexec --onnx=yolo11s.onnx --fp16 --minShapes=input:1x3x640x640 --optShapes=input:4x3x640x640 --maxShapes=input:8x3x640x640 --device=0 --iterations=200 --warmUp=10 --saveEngine=trtexec_yolo11.engine

trtexec is open source, please refer to TensorRT/samples/trtexec at release/8.6 · NVIDIA/TensorRT for how to parse and generate engine file from onnx model.

This is not aligned with your engine file. You generate the batch size 8 engine file while configure it as dynamic batch.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.