Troubleshooting Inconsistent Inference Output with Custom Sequence Preprocessing and Large Models

• Hardware Platform (Jetson / GPU) GPU
• DeepStream Version 7.0 (docker image: nvcr.io/nvidia/deepstream:7.0-triton-multiarch )
• NVIDIA GPU Driver Version (valid for GPU only) 535.171.04

Hi,

as I mentioned in my previous post Custom Sequence Preprocess Library for NSCHW Model, I have implemented a custom version of deepstream-3d-action-recognition/custom_sequence_preprocess. This implementation supports models with input shapes of both NCSHW and NSCHW formats (where N=batch_size, S=sequence_len, C=channels, H=height, W=width).

Pipeline Setup

I’ve tested this custom sequence preprocessing library using a simple pipeline with DeepStream Service Maker C++ APIs. Below is the pipeline configuration:

deepstream:
  nodes:
    - type: nvurisrcbin
      name: source0
      properties:
        uri: rtsp://localhost:8554/stream_0
    - type: nvstreammux
      name: mux
      properties:
        batch-size: 1
        config-file-path: config_streammux.txt
    - type: queue
      name: queue_preprocess
    - type: nvdspreprocess
      name: preprocess
      properties:
        # config-file: ../models/actionrecognitionnet/configs/config_sequence_preprocess_3d_action.txt
        config-file: ../models/xclip/configs/config_sequence_preprocess_xclip.txt
    - type: queue
      name: queue_gie
    - type: nvinfer
      name: primary_gie
      properties:
        unique-id: 1
        process-mode: 1 #Primary
        batch-size: 1
        # config-file-path: ../models/actionrecognitionnet/configs/config_nvinfer_3d_action.yml
        config-file-path: ../models/xclip/configs/config_nvinfer_xclip.yml
    - type: fakesink
      name: sink
  edges:
    source0: mux
    mux: queue_preprocess
    queue_preprocess: preprocess
    preprocess: queue_gie
    queue_gie: primary_gie
    primary_gie: sink

I am using the New Gst-nvstreammux with the configuration below:

[property]
adaptive-batching=1
## Set to maximum fps
overall-min-fps-n=30
overall-min-fps-d=1
## Set to ceil(maximum fps/minimum fps)
max-same-source-frames=1

I’ve tested this pipeline with two different models as the primary GIE:

  1. The 3D ActionRecognitionNet model, which uses NCSHW input shapes.
  2. The XCLIP model, which I exported from PyTorch to ONNX and then converted to a TensorRT engine with NSCHW input shapes.

To capture the inference output, I attach this probe to the source pad of the primary GIE:

static GstPadProbeReturn pgie_src_pad_buffer_probe(GstPad* pad, GstPadProbeInfo* info, gpointer u_data)
{
    GstBuffer* buf = (GstBuffer*)info->data;
    NvDsBatchMeta* batch_meta = gst_buffer_get_nvds_batch_meta(buf);

    NvDsMetaList* l_user_meta = NULL;
    NvDsUserMeta* user_meta = NULL;
    for (l_user_meta = batch_meta->batch_user_meta_list; l_user_meta != NULL; l_user_meta = l_user_meta->next)
    {
        user_meta = (NvDsUserMeta*)(l_user_meta->data);
        if (user_meta->base_meta.meta_type == NVDS_PREPROCESS_BATCH_META)
        {
            GstNvDsPreProcessBatchMeta* preprocess_batchmeta = (GstNvDsPreProcessBatchMeta*)(user_meta->user_meta_data);
            for (auto& roi_meta : preprocess_batchmeta->roi_vector)
            {
                NvDsMetaList* l_classifier = NULL;
                for (l_classifier = roi_meta.classifier_meta_list; l_classifier != NULL;
                    l_classifier = l_classifier->next)
                {
                    NvDsClassifierMeta* classifier_meta = (NvDsClassifierMeta*)(l_classifier->data);
                    NvDsLabelInfoList* l_label;
                    for (l_label = classifier_meta->label_info_list; l_label != NULL; l_label = l_classifier->next)
                    {
                        NvDsLabelInfo* label_info = (NvDsLabelInfo*)l_label->data;
                        std::cout << "Classification result - source-id: " << roi_meta.frame_meta->source_id
                            << ", frame-num: " << roi_meta.frame_meta->frame_num
                            << ", type: " << classifier_meta->classifier_type
                            << ", label: " << label_info->result_label << std::endl;
                    }
                }
            }
        }
    }

    return GST_PAD_PROBE_OK;
}

Configuration Details

Here are the nvinfer configuration files for each model:

Nvinfer 3D ActionRecognitionNet Configuration (NCSHW):

property:
  tlt-encoded-model: ../model_files/resnet18_3d_rgb_hmdb5_32.etlt
  tlt-model-key: nvidia_tao
  model-engine-file: ../model_files/resnet18_3d_rgb_hmdb5_32.etlt_b2_gpu0_fp16.engine
  labelfile-path: ../model_files/labels.txt
  gpu-id: 0
  network-mode: 2 # 0=FP32, 1=INT8, 2=FP16 mode

  input-tensor-from-meta: 1 # requires preprocess metadata input
  network-type: 1 # 0=Detection, 1=Classifier 2=Segmentation, 3=Instance Segmentation, 100: other
  classifier-type: action3d
  tensor-meta-pool-size: 8

Nvinfer XCLIP Configuration (NSCHW):

property:
  model-engine-file: ../model_files/xclip-base-patch16-zero-shot.onnx_b8_gpu0_fp16.engine
  labelfile-path: ../model_files/labels.txt
  gpu-id: 0
  network-mode: 2 # 0=FP32, 1=INT8, 2=FP16 mode

  input-tensor-from-meta: 1 # requires preprocess metadata input
  network-type: 1 # 0=Detection, 1=Classifier 2=Segmentation, 3=Instance Segmentation, 100: other
  classifier-type: recognition
  tensor-meta-pool-size: 8

Here are the custom sequence prepocessing configuration files for each model:

Sequence prepocessing 3D ActionRecognitionNet Configuration (NCSHW):

[property]
target-unique-ids=1

# 0=process on objects 1=process on frames
process-on-frame=1

# network-input-shape: batch, channel, sequence, height, width
# 3D sequence of 32 images
network-input-shape=1;3;32;224;224

    # 0=RGB, 1=BGR, 2=GRAY
network-color-format=0
    # 0=NCHW, 1=NHWC, 2=CUSTOM
network-input-order=2

# 0=FP32, 1=UINT8, 2=INT8, 3=UINT32, 4=INT32, 5=FP16
tensor-data-type=0
tensor-name=input_rgb

processing-width=224
processing-height=224

maintain-aspect-ratio=1
symmetric-padding=1

    # 0=NVBUF_MEM_DEFAULT 1=NVBUF_MEM_CUDA_PINNED 2=NVBUF_MEM_CUDA_DEVICE
    # 3=NVBUF_MEM_CUDA_UNIFIED  4=NVBUF_MEM_SURFACE_ARRAY(Jetson)
scaling-pool-memory-type=0

    # 0=NvBufSurfTransformCompute_Default 1=NvBufSurfTransformCompute_GPU
    # 2=NvBufSurfTransformCompute_VIC(Jetson)
scaling-pool-compute-hw=0

    # Scaling Interpolation method
    # 0=NvBufSurfTransformInter_Nearest 1=NvBufSurfTransformInter_Bilinear 2=NvBufSurfTransformInter_Algo1
    # 3=NvBufSurfTransformInter_Algo2 4=NvBufSurfTransformInter_Algo3 5=NvBufSurfTransformInter_Algo4
    # 6=NvBufSurfTransformInter_Default
scaling-filter=0

# max buffer in scaling buffer pool
scaling-buf-pool-size=8
# max buffer in tensor buffer pool
tensor-buf-pool-size=8

custom-lib-path=/usr/src/myapp/build/libs/deepstream_lib/custom_sequence_preprocess/libcustom_sequence_preprocess.so
custom-tensor-preparation-function=CustomSequenceTensorPreparation

# 3D conv custom params
[user-configs]
channel-scale-factors=0.007843137;0.007843137;0.007843137
channel-mean-offsets=127.5;127.5;127.5
stride=32
subsample=0

[group-0]
src-ids=0;1
process-on-roi=0

Sequence prepocessing XCLIP Configuration (NSCHW):

[property]
target-unique-ids=1

# 0=process on objects 1=process on frames
process-on-frame=1

# network-input-shape: batch, channel, sequence, height, width
# 3D sequence of 32 images
network-input-shape=1;3;32;224;224

    # 0=RGB, 1=BGR, 2=GRAY
network-color-format=0
    # 0=NCHW, 1=NHWC, 2=CUSTOM
network-input-order=2

# 0=FP32, 1=UINT8, 2=INT8, 3=UINT32, 4=INT32, 5=FP16
tensor-data-type=0
tensor-name=input_rgb

processing-width=224
processing-height=224

maintain-aspect-ratio=1
symmetric-padding=1

    # 0=NVBUF_MEM_DEFAULT 1=NVBUF_MEM_CUDA_PINNED 2=NVBUF_MEM_CUDA_DEVICE
    # 3=NVBUF_MEM_CUDA_UNIFIED  4=NVBUF_MEM_SURFACE_ARRAY(Jetson)
scaling-pool-memory-type=0

    # 0=NvBufSurfTransformCompute_Default 1=NvBufSurfTransformCompute_GPU
    # 2=NvBufSurfTransformCompute_VIC(Jetson)
scaling-pool-compute-hw=0

    # Scaling Interpolation method
    # 0=NvBufSurfTransformInter_Nearest 1=NvBufSurfTransformInter_Bilinear 2=NvBufSurfTransformInter_Algo1
    # 3=NvBufSurfTransformInter_Algo2 4=NvBufSurfTransformInter_Algo3 5=NvBufSurfTransformInter_Algo4
    # 6=NvBufSurfTransformInter_Default
scaling-filter=0

# max buffer in scaling buffer pool
scaling-buf-pool-size=8
# max buffer in tensor buffer pool
tensor-buf-pool-size=8

custom-lib-path=/usr/src/myapp/build/libs/deepstream_lib/custom_sequence_preprocess/libcustom_sequence_preprocess.so
custom-tensor-preparation-function=CustomSequenceTensorPreparation

# 3D conv custom params
[user-configs]
channel-scale-factors=0.007843137;0.007843137;0.007843137
channel-mean-offsets=127.5;127.5;127.5
stride=32
subsample=0

[group-0]
src-ids=0;1
process-on-roi=0

The configuration files for custom sequence preprocessing are nearly identical, with differences only in the network-input-shape, channel-scale-factors, and channel-mean-offsets to suit each model’s specifications.

Issue Encountered

When using an RTSP stream with H.264 encoding at 640x640 resolution and 20 fps, the pipeline works as expected with both configurations. Since the preprocessing step is setted with a stride of 32, the model successfully produces inference outputs every 32 frames, confirming that the custom sequence preprocessing library is functioning correctly on both NCSHW and NSCHW configurations. The result is confirmed even if I use an MP4 video as input, instead of the RTSP stream, with the same resolution.

However, when I switch to an RTSP stream with a resolution of 1920x1080 at 30 fps, only the 3D ActionRecognitionNet (NCSHW) configuration behaves as expected, producing inference outputs every 32 frames. In contrast, the XCLIP (NSCHW) configuration behaves inconsistently, producing inference outputs sporadically—sometimes after thousands of frames, other times after hundreds, with no apparent pattern. The issue is the same even if I use an MP4 video as input, instead of the RTSP stream, with the same resolution.

Interestingly, debug logs indicate that the Custom Sequence Preprocess library is correctly preparing a full batched sequence tensor every 32 frames:

A full batched sequence tensor is ready on last frame

This suggests that the issue may not lie within the preprocessing library.

I’m wondering if the problem could be related to the model size. The XCLIP model is significantly larger than the 3D ActionRecognitionNet model, and its performance metrics from trtexec are as follows:

[08/30/2024-13:08:36] [I] === Performance summary ===
[08/30/2024-13:08:36] [I] Throughput: 3.5137 qps
[08/30/2024-13:08:36] [I] Latency: min = 266.574 ms, max = 280.172 ms, mean = 273.632 ms, median = 273.005 ms, percentile(90%) = 278.941 ms, percentile(95%) = 280.172 ms, percentile(99%) = 280.172 ms
[08/30/2024-13:08:36] [I] Enqueue Time: min = 0.677918 ms, max = 2.04102 ms, mean = 1.12804 ms, median = 1.04211 ms, percentile(90%) = 1.37646 ms, percentile(95%) = 2.04102 ms, percentile(99%) = 2.04102 ms
[08/30/2024-13:08:36] [I] H2D Latency: min = 6.88428 ms, max = 8.15256 ms, mean = 7.40798 ms, median = 7.38599 ms, percentile(90%) = 7.72327 ms, percentile(95%) = 8.15256 ms, percentile(99%) = 8.15256 ms
[08/30/2024-13:08:36] [I] GPU Compute Time: min = 259.231 ms, max = 273.051 ms, mean = 266.216 ms, median = 265.581 ms, percentile(90%) = 271.731 ms, percentile(95%) = 273.051 ms, percentile(99%) = 273.051 ms
[08/30/2024-13:08:36] [I] D2H Latency: min = 0.00415039 ms, max = 0.02771 ms, mean = 0.00793021 ms, median = 0.00653076 ms, percentile(90%) = 0.00842285 ms, percentile(95%) = 0.02771 ms, percentile(99%) = 0.02771 ms
[08/30/2024-13:08:36] [I] Total Host Walltime: 3.98441 s
[08/30/2024-13:08:36] [I] Total GPU Compute Time: 3.72702 s

Questions

  1. Could the inconsistent inference output be due to the higher input resolution in combination to the large size and slower processing speed of the XCLIP model in combination?
  2. Is there a possibility that the sequence tensor prepared by the preprocessing step is not reaching the model consistently, or is there an issue with how the model processes these sequences?

Any insights or suggestions to help diagnose this issue would be greatly appreciated.

Thanks in advance for your assistance!

Maybe. There may be frame dropping within sink when you try to run the pipeline with 30FPS speed.

No.

The model speed is not more than 3FPS. You may try to run the pipeline with 3FPS speed.

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.