Outputs of Secondary GIE are not consistent across multiple runs of the same video in Parallel Inference

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU)
GPU (A5000)
• DeepStream Version
Deepstream 6.2
• JetPack Version (valid for Jetson only)
• TensorRT Version
TensorRT 8.5.2
• NVIDIA GPU Driver Version (valid for GPU only)
Version 525.85
• Issue Type( questions, new requirements, bugs)
Potential Bug
• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)

The issue can be recreated using the sample provided at GitHub - NVIDIA-AI-IOT/deepstream_parallel_inference_app: A project demonstrating how to use nvmetamux to run multiple models in parallel. All the models and videos come from the samples, so it can be recreated without additional data.

[Config File]

# SPDX-FileCopyrightText: Copyright (c) <2022> NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: MIT
#
# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the "Software"),
# to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense,
# and/or sell copies of the Software, and to permit persons to whom the
# Software is furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
# DEALINGS IN THE SOFTWARE.
# The values in the config file are overridden by values set through GObject
# properties.

application:
  enable-perf-measurement: 1
  perf-measurement-interval-sec: 5

tiled-display:
  enable: 0
  rows: 2
  columns: 2
  width: 1280
  height: 720
  gpu-id: 0
  #(0): nvbuf-mem-default - Default memory allocated, specific to particular platform
  #(1): nvbuf-mem-cuda-pinned - Allocate Pinned/Host cuda memory, applicable for Tesla
  #(2): nvbuf-mem-cuda-device - Allocate Device cuda memory, applicable for Tesla
  #(3): nvbuf-mem-cuda-unified - Allocate Unified cuda memory, applicable for Tesla
  #(4): nvbuf-mem-surface-array - Allocate Surface Array memory, applicable for Jetson
  nvbuf-memory-type: 0

source:
  csv-file-path: sources_2_different_sources.csv
  #csv-file-path: sources_4_different_source.csv
  #csv-file-path: sources_4_rtsp.csv

sink0:
  enable: 1
  #Type - 1=FakeSink 2=EglSink 3=File 7=nv3dsink (Jetson only)
  type: 1
  sync: 1
  source-id: 0
  gpu-id: 0
  nvbuf-memory-type: 3

osd:
  enable: 1
  gpu-id: 0
  border-width: 1
  text-size: 15
  #value changed
  text-color: 1;1;1;1
  text-bg-color: 0.3;0.3;0.3;1
  font: Serif
  show-clock: 0
  clock-x-offset: 800
  clock-y-offset: 820
  clock-text-size: 12
  clock-color: 1;0;0;0
  nvbuf-memory-type: 3

streammux:
  gpu-id: 0
  ##Boolean property to inform muxer that sources are live
  live-source: 0
  buffer-pool-size: 4
  batch-size: 2
  ##time out in usec, to wait after the first buffer is available
  ##to push the batch even if the complete batch is not formed
  batched-push-timeout: 40000
  ## Set muxer output width and height
  width: 1920
  height: 1080
  ##Enable to maintain aspect ratio wrt source, and allow black borders, works
  ##along with width, height properties
  enable-padding: 0
  nvbuf-memory-type: 3

primary-gie0:
  enable: 1
  #(0): nvinfer; (1): nvinferserver
  plugin-type: 0
  gpu-id: 0
  #input-tensor-meta: 1
  batch-size: 4
  #Required by the app for OSD, not a plugin property
  bbox-border-color0: 1;0;0;1
  bbox-border-color1: 0;1;1;1
  bbox-border-color2: 0;0;1;1
  bbox-border-color3: 0;1;0;1
  #interval: 0
  gie-unique-id: 1
  nvbuf-memory-type: 3
  #config-file: ../../yolov4/config_yolov4_inferserver.txt
  config-file: ../yolov4/config_yolov4_infer.txt

branch0:
  ## pgie's id
  pgie-id: 1
  ## select sources by sourceid
  src-ids: 0;1 
  
secondary-gie0:
  enable: 1
  ##support mulptiple sgie.
  cfg-file-path: secondary-gie0.yml

primary-gie1:
  enable: 1
  #(0): nvinfer; (1): nvinferserver
  plugin-type: 0
  gpu-id: 0
  #input-tensor-meta: 1
  batch-size: 4
  #Required by the app for OSD, not a plugin property
  bbox-border-color0: 1;0;0;1
  bbox-border-color1: 0;1;1;1
  bbox-border-color2: 0;0;1;1
  bbox-border-color3: 0;1;0;1
  #interval: 0
  gie-unique-id: 2
  nvbuf-memory-type: 3
  #config-file: ../../yolov4/config_yolov4_inferserver.txt
  config-file: ../yolov4/config_yolov4_infer.txt

branch1:
  ## pgie's id
  pgie-id: 2
  ## select sources by sourceid
  src-ids: 0;1

secondary-gie1:
  enable: 1
  ##supoort multiple sgie
  cfg-file-path: ./secondary-gie1.yml

meta-mux:
  enable: 1
  #config-file: ../../metamux/config_metamux0.txt
  config-file: ./config_metamux0.txt

tests:
  file-loop: 0

[sources_2_different_sources.csv]
Both of the sample videos are available with deepstream

enable,type,uri,num-sources,gpu-id,cudadec-memtype
1,3,file:///opt/nvidia/deepstream/deepstream/samples/streams/sample_1080p_h264.mp4,1,0,2
1,3,file:///opt/nvidia/deepstream/deepstream/samples/streams/sample_qHD.mp4,1,0,0

[secondary-gie0.yml]
For demonstration purposes, config_infer_secondary_vehicletypes.txt model is applied on class-id 0 : Person.

secondary-gie0:
  enable: 1
  ##(0): nvinfer; (1): nvinferserver
  plugin-type: 0
  ## nvinferserserver's gpu-id can only set from its own config-file
  #gpu-id=0
  batch-size: 16
  gie-unique-id: 11
  operate-on-gie-id: 1
  operate-on-class-ids: 2
  config-file: config_infer_secondary_carcolor.txt

secondary-gie1:
  enable: 1
  ##(0): nvinfer; (1): nvinferserver
  plugin-type: 0
  ## nvinferserserver's gpu-id can only set from its own config-file
  #gpu-id=0
  batch-size: 16
  gie-unique-id: 12
  operate-on-gie-id: 1
  operate-on-class-ids: 0
  config-file: config_infer_secondary_vehicletypes.txt

Updated body_pose_gie_src_pad_buffer_probe function to record PGIE and SGIE inference results as follows:

static GstPadProbeReturn
body_pose_gie_src_pad_buffer_probe(GstPad *pad, GstPadProbeInfo *info,
                          gpointer u_data)
{
  gchar *msg = NULL;
  GstBuffer *buf = (GstBuffer *)info->data;
  NvDsMetaList *l_frame = NULL;
  NvDsMetaList *l_obj = NULL;
  NvDsMetaList *l_user = NULL;
  NvDsMetaList *l_cls = NULL;
  NvDsBatchMeta *batch_meta = gst_buffer_get_nvds_batch_meta(buf);

  for (l_frame = batch_meta->frame_meta_list; l_frame != NULL;
       l_frame = l_frame->next)
  {
    NvDsFrameMeta *frame_meta = (NvDsFrameMeta *)(l_frame->data);

    if (frame_meta->batch_id == 0)
      g_print("Processing frame number = %d\t\n", frame_meta->frame_num);

    // for (l_user = frame_meta->frame_user_meta_list; l_user != NULL;
    //      l_user = l_user->next)
    // {
    //   NvDsUserMeta *user_meta = (NvDsUserMeta *)l_user->data;
    //   if (user_meta->base_meta.meta_type == NVDSINFER_TENSOR_OUTPUT_META)
    //   {
    //     NvDsInferTensorMeta *tensor_meta =
    //         (NvDsInferTensorMeta *)user_meta->user_meta_data;
    //     Vec2D<int> objects;
    //     Vec3D<float> normalized_peaks;
    //     tie(objects, normalized_peaks) = parse_objects_from_tensor_meta(tensor_meta);
    //     create_display_meta(objects, normalized_peaks, frame_meta, frame_meta->source_frame_width, frame_meta->source_frame_height);
    //   }
    // }

    for (l_obj = frame_meta->obj_meta_list; l_obj != NULL;
         l_obj = l_obj->next)
    {
      NvDsObjectMeta *obj_meta = (NvDsObjectMeta *)l_obj->data;
      // for (l_user = obj_meta->obj_user_meta_list; l_user != NULL;
      //      l_user = l_user->next)
      // {
      //   NvDsUserMeta *user_meta = (NvDsUserMeta *)l_user->data;
      //   if (user_meta->base_meta.meta_type == NVDSINFER_TENSOR_OUTPUT_META)
      //   {
      //     NvDsInferTensorMeta *tensor_meta =
      //         (NvDsInferTensorMeta *)user_meta->user_meta_data;
      //     Vec2D<int> objects;
      //     Vec3D<float> normalized_peaks;
      //     tie(objects, normalized_peaks) = parse_objects_from_tensor_meta(tensor_meta);
      //     create_display_meta(objects, normalized_peaks, frame_meta, frame_meta->source_frame_width, frame_meta->source_frame_height);
      //   }
      // }

      // Recording Inference Results
      float left = obj_meta->detector_bbox_info.org_bbox_coords.left;
      float top = obj_meta->detector_bbox_info.org_bbox_coords.top;
      float right = left + obj_meta->detector_bbox_info.org_bbox_coords.width;
      float bottom = top + obj_meta->detector_bbox_info.org_bbox_coords.height;
      float confidence = obj_meta->confidence;

      char outname[256];
      sprintf(outname, "./outputs/tmp_src_%d.txt", frame_meta->source_id);      
      FILE* fp = fopen(outname, "a");
      if(fp)
      {
        fprintf (fp, "[%d] cls_id %d comp_id %d :: %s :: %f %f %f %f :: conf %f\n",
            frame_meta->frame_num, obj_meta->class_id, obj_meta->unique_component_id,
            obj_meta->obj_label, left, top, right, bottom, confidence);
        fclose(fp);
      }

      #if 1
      for (l_cls = obj_meta->classifier_meta_list; l_cls != NULL; l_cls = l_cls->next)
      {
        NvDsClassifierMeta *cls_meta = (NvDsClassifierMeta *)l_cls->data;

        NvDsLabelInfoList* l_label;
        for (l_label = cls_meta->label_info_list; l_label != NULL; l_label = l_label->next)
        {
          NvDsLabelInfo *label_meta = (NvDsLabelInfo*) l_label->data;
          fp = fopen(outname, "a");
          if(fp)
          {
            fprintf(fp, "%d %s %d %f\n",
                        cls_meta->unique_component_id,
                        label_meta->result_label, label_meta->result_class_id, label_meta->result_prob);
            fclose(fp);
          }
        }
      }
      #endif

      #if 1
      for (l_user = obj_meta->obj_user_meta_list; l_user != NULL;
           l_user = l_user->next)
      {
        NvDsUserMeta *user_meta = (NvDsUserMeta *)l_user->data;
        if (user_meta->base_meta.meta_type == NVDSINFER_TENSOR_OUTPUT_META)
        {
          NvDsInferTensorMeta *tensor_meta =
              (NvDsInferTensorMeta *)user_meta->user_meta_data;
          /** Holds the TensorRT binding index of the layer. */
          int bindingIndex = tensor_meta->output_layers_info->bindingIndex;              
          const char *layerName = tensor_meta->output_layers_info->layerName;
          void* map_data = tensor_meta->out_buf_ptrs_host[0];
          if(strcmp(layerName,"predictions/Softmax")==0)
          {
            float* data = (float *)map_data;
            fp = fopen(outname, "a");
            if(fp)
            {
              fprintf(fp, "[%d] :: %s %d  %f %f %f %f %f %f\n",
                          frame_meta->frame_num,
                          layerName, bindingIndex,
                          data[0], data[1], data[2], data[3], data[4], data[5]);
              fclose(fp);
            }
          }
        }
      }
      #endif
      
      // Writing the same user_meta data to a file 10 times produces identical results after every run
      #if 0 
      for(int i=0; i<10; i++)
      {
        #if 1
        for (l_user = obj_meta->obj_user_meta_list; l_user != NULL;
            l_user = l_user->next)
        {
          NvDsUserMeta *user_meta = (NvDsUserMeta *)l_user->data;
          if (user_meta->base_meta.meta_type == NVDSINFER_TENSOR_OUTPUT_META)
          {
            NvDsInferTensorMeta *tensor_meta =
                (NvDsInferTensorMeta *)user_meta->user_meta_data;
            /** Holds the TensorRT binding index of the layer. */
            int bindingIndex = tensor_meta->output_layers_info->bindingIndex;              
            const char *layerName = tensor_meta->output_layers_info->layerName;
            void* map_data = tensor_meta->out_buf_ptrs_host[0];
            if(strcmp(layerName,"predictions/Softmax")==0)
            {
              float* data = (float *)map_data;
              fp = fopen(outname, "a");
              if(fp)
              {
                fprintf(fp, "[%d] :: %s %d  %f %f %f %f %f %f\n",
                            frame_meta->frame_num,
                            layerName, bindingIndex,
                            data[0], data[1], data[2], data[3], data[4], data[5]);
                fclose(fp);
              }
            }
          }
        }
        #endif        
      }
      #endif  
    }
  }
  return GST_PAD_PROBE_OK;
}

• Requirement details( This is for new requirement. Including the module name-for which plugin or for which sample application, the function description)

We are using the parallel inference sample provided at GitHub - NVIDIA-AI-IOT/deepstream_parallel_inference_app: A project demonstrating how to use nvmetamux to run multiple models in parallel. to create a pipeline with two detector+classifier branches combined using metamux. The most basic pipeline diagram can be seen below :

The issue we are facing is the inconsistency of the tensor outputs produced by Secondary GIE’s across multiple runs of the same input videos. We write the inference results to a file by accessing batch_meta and user_meta inside body_pose_gie_src_pad_buffer_probe probe and compare the output across multiple runs. The outputs tend to be identical for a number consecutive frames, followed by a number of consecutive “Same PGIE results but completely different SGIE results” frames. As can be seen in the screenshots below, the Softmax classification outputs are completely different, which leads us to believe it is not just a precision issue but something different.

These inconsistencies repeat multiple times without a specific pattern. I have uploaded the recorded inference results for run1.txt and run2.txt as well.
run1.txt (951.8 KB)
run2.txt (954.5 KB)

Furthermore, after some testing we discovered that introducing a little processing delay into the probe leads to identical PGIE & SGIE results across multiple runs. If we add a dummy for_loop into body_pose_gie_src_pad_buffer_probe that writes object_user_meta multiple times, the inference results across multiple runs match significantly better or even perfectly.

It would be great if you could look into this issue and let us know why it happens and provide us help in fixing it. Thank you!

Hi @bolat.ashim , could you attach the 2 config files below?

config_infer_secondary_carcolor.txt
config_infer_secondary_vehicletypes.txt

Also you can try to set the classifier-async-mode=0 filed in the config_body2_infer.txt file to see if the result is the same.

Thank you for the reply! Here are the config files for secondary models, both of which are the same as in the samples.
config_infer_secondary_carcolor.txt

[property]
gpu-id=0
net-scale-factor=1
labelfile-path=./models_ds62/Secondary_CarColor/labels.txt
model-engine-file=../../../../tritonserver/models/Secondary_CarColor/1/resnet18.caffemodel_b16_gpu0_int8.engine
force-implicit-batch-dim=1
batch-size=16
model-color-format=1
process-mode=2
## 0=FP32, 1=INT8, 2=FP16 mode
network-mode=1
is-classifier=1
output-blob-names=predictions/Softmax
classifier-async-mode=0
classifier-threshold=0.51
input-object-min-width=100
input-object-min-height=100
classifier-type=carcolor
output-tensor-meta=1

config_infer_secondary_vehicletypes.txt

[property]
gpu-id=0
net-scale-factor=1
labelfile-path=./models_ds62/Secondary_VehicleTypes/labels.txt
model-engine-file=../../../../tritonserver/models/Secondary_VehicleTypes/1/resnet18.caffemodel_b16_gpu0_int8.engine
force-implicit-batch-dim=1
batch-size=16
model-color-format=1
## 0=FP32, 1=INT8, 2=FP16 mode
network-mode=1
is-classifier=1
process-mode=2
output-blob-names=predictions/Softmax
classifier-async-mode=0
classifier-threshold=0.51
input-object-min-width=100
input-object-min-height=100
classifier-type=vehicletype
output-tensor-meta=1

The property classifier-async-mode was already set to 0 in both secondary config files.

I also tested adding classifier-async-mode=0 to the primary detector config file but the inconsistency in inference outputs is the same as before. Below is the primary detector’s config file.


[property]
gpu-id=0
net-scale-factor=0.0039215697906911373
#0=RGB, 1=BGR
model-color-format=0
#onnx-file=../../../../tritonserver/models/yolov4/1/yolov4_-1_3_416_416_dynamic.onnx.nms.onnx
model-engine-file=../../../../tritonserver/models/yolov4/1/yolov4_-1_3_416_416_dynamic.onnx_b32_gpu0.engine
labelfile-path=coco.names
#batch-size=1
## 0=FP32, 1=INT8, 2=FP16 mode
network-mode=2
num-detected-classes=80
gie-unique-id=1
network-type=0
classifier-async-mode=0
#is-classifier=0
## 0=Group Rectangles, 1=DBSCAN, 2=NMS, 3= DBSCAN+NMS Hybrid, 4 = None(No clustering)
cluster-mode=2
maintain-aspect-ratio=1
parse-bbox-func-name=NvDsInferParseCustomYoloV4
custom-lib-path=../../gst-plugins/gst-nvinferserver/nvdsinfer_custom_impl_Yolo/libnvdsinfer_custom_impl_Yolo.so
#scaling-filter=0
#scaling-compute-hw=0

[class-attrs-all]
nms-iou-threshold=0.6
pre-cluster-threshold=0.4

I have run the demo in my env: docker: http://nvcr.io/nvidia/deepstream:6.2-triton on T4.
When I use the fp16 accuracy model engine, there are only few frames have the bias like:

But when I use the fp32 accuracy model engine, there are only very small bias after the three decimal places of the coordinate value.


Can this meet your expectations?

Thank you for checking it out. Unfortunately, we are using INT8 quantization in our models and the inconsistency in inference results is a lot more frequent and overwhelming making us unable to meet the requirements of our application. Is it possible to find out what could be the cause of this issue and any ways to make it manageable?

To be more precise, the problem is not the inconsistency in the values after several decimal points but rather completely different inference results for Secondary Models. While primary detector result for Car having confidence score of 0.8888 or 0.88789 is totally alright, the result of a secondary model such as vehicle_type giving large_vehicle in one case and giving sedan in another case for the same exact bounding box is not acceptable.

OK. We will focus on the issue of different car types and colors after multiple runs with INT8 quantization models. We will provide feedback as soon as we have a conclusion.

1 Like

Hi @bolat.ashim , this looks like a problem when you fprintf files. Could you try to add fflush(fp) after the fprintf?

          if(fp)
          {
            fprintf(fp, "[%d] %d %s %d %f\n",
                        frame_meta->frame_num,
                        cls_meta->unique_component_id,
                        label_meta->result_label, label_meta->result_class_id, label_meta->result_prob);
+          fflush(fp);
            fclose(fp);
          }

Also, please make sure that the classifier-async-mode=0 filed is set in the config file.

Hello, @yuweiw! We have tested adding fflush() and it has somewhat improved the outputs that are generated but the problem of wrong secondary inference results were still there.

We tried adjusting the pipeline structure, and found that using gst-nvdsmetamux before the secondary-gie and after primary-gie bins produces same or very similar outputs across multiple runs of the same videos. There are usually differences after several decimal points though. The pipeline looks like below:

We are planning on using this structure for our application as it should meet the requirements. Thank you for looking into this issue.

Hello, @yuweiw ! As a follow-up, it would be great if you could let us know about the following with regard to this issue :

  1. Does it make sense for us to use the adjusted pipeline? While technically the initial version and the adjusted version of the pipeline are supposed to produce the exact same results, they happen not to match, especially with lower model precision.

  2. Is it possible to get an internal review of this phenomenon to understand what causes it and how it is fixable? Presently, construction of parallel pipelines with metamux, does not seem to be completely reliable.

Thank you.

About the 1st issue, could you attach a HD picture of the pipeline? The picture attached is kind of blurry.
About the 2nd issue, it’s weird that I ran that 5 times with fflush. The results were both same. What is the probability of this problem occurring in your environment with fflush?

(1)
I attached the following files to showcase the differences in the results from two pipelines.

  • original_pipeline.png is the same as the one provided in the NVIDIA-AI-IOT github page. 2 sources, 2 primary detectors, and 4 secondary models.

  • adjusted_pipeline.png is our modification to the original by placing metamux after primary gies unlike in the original pipeline. We adjusted some code in the secondary_gie_bin.c to run secondary inference on the appropriate primary_gie results, as all the meta from both primary gie’s has been muxed by the time secondary inference is performed.

Both pipelines are supposed to produce the exact same results, because they perform the exact same inference operations.

(2)
We have used fflush for every run of the program, and with the original_pipeline, the inconsistency of inference results happens every single time. The amount of inconsistency is inversely correlated to the precision of models used. Using the the models provided with NVIDIA-AI-IOT samples, I also attached the results of 2 separate runs of the pipeline and corresponding results.

  • original_run1_src0.txt inference results for src0 on the first run of original pipeline
  • original_run1_src1.txt inference results for src1 on the first run of original pipeline
  • original_run2_src0.txt inference results for src0 on the second run of original pipeline
  • original_run2_src1.txt inference results for src1 on the second run of original pipeline

If you compare the outputs from original_run1_src0.txt<->original_run2_src0.txt or original_run1_src1.txt<->original_run2_src1.txt inconsistency happens with a frequency of once every ~70-100 frames. This happens every time we test this.

I also attached the results of 2 separate runs of the adjusted pipeline.

  • adjusted_run1_src0.txt inference results for src0 on the first run of adjusted pipeline
  • adjusted_run1_src1.txt inference results for src1 on the first run of adjusted pipeline
  • adjusted_run2_src0.txt inference results for src0 on the second run of adjusted pipeline
  • adjusted_run2_src1.txt inference results for src1 on the second run of adjusted pipeline

Inconsistency that happens in the original_pipeline does not occur in the adjusted version of the pipeline.

inference_outputs.zip (649.3 KB)
pipeline_images.zip (2.5 MB)

Thank you!

OK. Could you attach the patches for the adjusted pipeline? We will analyze the differences between these two scenarios. The labels of the original pipeline are inconsistent. But the labels of the adjusted pipeline are consistent. Is that right?

Yes, that is right.
Here are the patches for the files with modified code:
deepstream_parallel_infer_app.patch (13.0 KB)
deepstream_secondary_gie_bin.patch (444 Bytes)

Below are the config files we used in our adjusted pipeline:
config_files.tar.gz (12.9 MB)

Thank you!

I run the demo with the patches, the adjusted pipeline are consistent. But it’s werid that metamux is only for merging metadata and will not change any metadata information. You can use it this way for now, and we will continue debugging for this.

Thank you! Please let us know, when you find a way to resolve this problem.