Input Tensor is unexpectedly modified before fetched to primary detector

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU): GPU
• DeepStream Version: 6.4
• JetPack Version (valid for Jetson only)
• TensorRT Version:
• NVIDIA GPU Driver Version (valid for GPU only): Driver Version: 525.147.05
• Issue Type( questions, new requirements, bugs): bugs
• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)

  • Why we raise this issue?
    Primary detector gives different bounding boxes compared to results gotten when directly calling a Triton Inference server with the same model.engine file.

The bounding boxes are not entirely incorrect, but rather slightly modified (almost worsened), resembling errors in floating-point processing.

  • Pipeline setup:

My pipeline:
uridecodebin -> nvstreammux -> queue -> nvinfer (primary detector)

  • Element config:
  1. Input source is a video of size H x W = 640 x 640.
  2. Streammux
      streammux.set_property("width", 640)
      streammux.set_property("height", 640)
      streammux.set_property("batch-size", num_sources)
      streammux.set_property("batched-push-timeout", self.config["batched-push-timeout"])
      streammux.set_property("enable-padding", 0)
      streammux.set_property("interpolation-method", 4)
  1. nvinfer (primary detector)
  gpu-id: 0
  net-scale-factor: 0.0039215697906911373
  offsets: 0;0;0
  model-color-format: 0
  onnx-file: ../models/peoplenet_yolov8x/yolov8x.onnx
  model-engine-file: ../models/peoplenet_yolov8x/yolov8x.onnx_b1_gpu0_fp32.engine
  labelfile-path: ../models/peoplenet_yolov8x/labels.txt
  batch-size: 1
  network-mode: 0
  num-detected-classes: 80
  interval: 0
  gie-unique-id: 1
  filter-out-class-ids: 1;2;3;4;5;6;7;8;9;10;11;12;13;14;15;16;17;18;19;20;21;22;23;24;25;26;27;28;29;30;31;32;33;34;35;36;37;38;39;40;41;42;43;44;45;46;47;48;49;50;51;52;53;54;55;56;57;58;59;60;61;62;63;64;65;66;67;68;69;70;71;72;73;74;75;76;77;78;79
  process-mode: 1
  network-type: 0
  cluster-mode: 2
  maintain-aspect-ratio: 0
  symmetric-padding: 0
  workspace-size: 1000
  parse-bbox-func-name: NvDsInferParseYolo
  custom-lib-path: ../custom_parser/
  output-tensor-meta: 1
  # engine-create-func-name: NvDsInferYoloCudaEngineGet
  crop-objects-to-roi-boundary: 1

  pre-cluster-threshold: 0.21666836936549047
  nms-iou-threshold: 0.5645207000469065
  minBoxes: 2
  dbscan-min-score: 0.693671458753017
  eps: 0.15584185873130887
  detected-min-w: 20
  detected-min-h: 20
  • Our effort/investigation?
  • We resized the video to 640x640, as width and height of streammux.
  • We try to use different values of interpolation-method but the difference is still happened.
  • We did a simple test, by replacing the nvinfer element to nvinferserver with python backend, we are able to receive and dump input tensors before fetched to detector. We see that the input tensors are a bit different compared to original frames, which are extracted from the input video.

Image 1: Second channel of the 1st frame, logged before fetched to the model in nvinferserver
Image 2: Second channel of the 1st frame, extracted by using opencv-python.

Video for testing: (5.8 MB)

• Requirement details( This is for new requirement. Including the module name-for which plugin or for which sample application, the function description)

  1. could you share some screenshots to show the different bboxes?
  2. please refer to this faq for Debug Tips for DeepStream Accuracy Issue.
  3. how did you log the preprocessed data to get that two screenshots?
  1. Let me synthesize and send you later.
  2. Yes, I tried to tune those parameters, I can observed how the parameters work, but still can not make the bounding boxes to be same with results gotten from a standalone Triton Inference Server (with the same engine file)
  • The first image: gotten by writing a Python-backend nvinferserver, in which I put python code to log input tensors, having multiplied by net-scale-factor and plus with offsets . The pipeline is now uridecodebin -> nvstreammux -> queue -> nvinferserver.
  • The second image, gotten by adding probe to nvdsosd. The pipeline is now uridecodebin -> nvstreammux -> queue -> nvinfer (pgie) -> nvvideoconvert -> capsfilter -> nvdsosd. The additional elements behind pgie are just for getting the image.
  • Interestingly, frames of input video read by cv2.VideoCapture() and frames taken from both approach are different by pixel-to-pixel comparison.

Config of the nvinference server:

infer_config {
  unique_id: 5
  gpu_ids: [0]
  max_batch_size: 1
  backend {
    trt_is {
      model_name: "peoplenet_yolov8x_py"
      version: -1
      model_repo {
        root: "/iva/model_repository_triton"
        log_level: 2
        tf_gpu_memory_fraction: 0.4
        tf_disable_soft_placement: 0

  preprocess {
    network_format: IMAGE_FORMAT_RGB
    tensor_order: TENSOR_ORDER_LINEAR
    maintain_aspect_ratio: 0
    normalize {
      scale_factor: 1
      channel_offsets: [0, 0, 0]

  extra {
    copy_input_to_host_buffers: true

  custom_lib {
    path: "/opt/nvidia/deepstream/deepstream/lib/"
input_control {
  interval: 0
output_control {
  output_tensor_meta: true

Python code to get image by adding probe to nvdsosd:

gst_buffer = info.get_buffer()
if not gst_buffer:
    logging.error("Unable to get GstBuffer")

batch_meta = pyds.gst_buffer_get_nvds_batch_meta(hash(gst_buffer))
l_frame = batch_meta.frame_meta_list
while l_frame is not None:
        frame_meta = pyds.NvDsFrameMeta.cast(
    except StopIteration:
    img = pyds.get_nvds_buf_surface(hash(gst_buffer), frame_meta.batch_id)
    img_copy = np.array(img, copy=True, order='C')
    img_copy = cv2.cvtColor(img_copy, cv2.COLOR_RGBA2BGRA)

capsfilter settings:

caps.set_property("caps", Gst.Caps.from_string("video/x-raw(memory:NVMM), format=RGBA"))

  1. about “directly calling a Triton Inference server”, do you mean you are using python + triton to do inference without deepstream?
  2. you are using a DeepStream pipeline including nvinfer and python+triton without DeppStream to do inference respectively. and the using DeepStream’s results are worse. are I right? please refer to this yolov8 sample. let’s focus on nvinfer in this topic if using nvinferserver also has the worse results.
  3. In theory, if bboxes are different, we need to compare the data of preprocessing, inference results and postprocessing. here is the method to dump preprocessing and postprocessing data.
  1. Yes
  2. Yes, output bounding boxes from DeepStream (A) are different from output bounding boxes from Triton Inference server (B) mentioned in question 1. Bot output A and B look reasonable, B seems worse than A.
  3. Thanks, I will tried it.

By doing 2D object detection evaluation on a dataset, I am now Ok with output bounding boxes from DeepStream pipeline, it gives nearly similar performance compared to the results gotten from the Triton Inference Server of the same model file. Thanks.

Sorry for the late reply, Is this still an DeepStream issue to support? Thanks!

Sorry for late, this is not an issue for us any more, I will close it. Thank for support.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.