• Hardware Platform (Jetson / GPU) GPU
• DeepStream Version 7.0 (docker image: nvcr.io/nvidia/deepstream:7.0-triton-multiarch )
• NVIDIA GPU Driver Version (valid for GPU only) 535.171.04
Hi,
as I mentioned in my previous post Custom Sequence Preprocess Library for NSCHW Model, I have implemented a custom version of deepstream-3d-action-recognition/custom_sequence_preprocess. This implementation supports models with input shapes of both NCSHW and NSCHW formats (where N=batch_size, S=sequence_len, C=channels, H=height, W=width).
Pipeline Setup
I’ve tested this custom sequence preprocessing library using a simple pipeline with DeepStream Service Maker C++ APIs. Below is the pipeline configuration:
deepstream:
nodes:
- type: nvurisrcbin
name: source0
properties:
uri: rtsp://localhost:8554/stream_0
- type: nvstreammux
name: mux
properties:
batch-size: 1
config-file-path: config_streammux.txt
- type: queue
name: queue_preprocess
- type: nvdspreprocess
name: preprocess
properties:
# config-file: ../models/actionrecognitionnet/configs/config_sequence_preprocess_3d_action.txt
config-file: ../models/xclip/configs/config_sequence_preprocess_xclip.txt
- type: queue
name: queue_gie
- type: nvinfer
name: primary_gie
properties:
unique-id: 1
process-mode: 1 #Primary
batch-size: 1
# config-file-path: ../models/actionrecognitionnet/configs/config_nvinfer_3d_action.yml
config-file-path: ../models/xclip/configs/config_nvinfer_xclip.yml
- type: fakesink
name: sink
edges:
source0: mux
mux: queue_preprocess
queue_preprocess: preprocess
preprocess: queue_gie
queue_gie: primary_gie
primary_gie: sink
I am using the New Gst-nvstreammux with the configuration below:
[property]
adaptive-batching=1
## Set to maximum fps
overall-min-fps-n=30
overall-min-fps-d=1
## Set to ceil(maximum fps/minimum fps)
max-same-source-frames=1
I’ve tested this pipeline with two different models as the primary GIE:
- The 3D ActionRecognitionNet model, which uses NCSHW input shapes.
- The XCLIP model, which I exported from PyTorch to ONNX and then converted to a TensorRT engine with NSCHW input shapes.
To capture the inference output, I attach this probe to the source pad of the primary GIE:
static GstPadProbeReturn pgie_src_pad_buffer_probe(GstPad* pad, GstPadProbeInfo* info, gpointer u_data)
{
GstBuffer* buf = (GstBuffer*)info->data;
NvDsBatchMeta* batch_meta = gst_buffer_get_nvds_batch_meta(buf);
NvDsMetaList* l_user_meta = NULL;
NvDsUserMeta* user_meta = NULL;
for (l_user_meta = batch_meta->batch_user_meta_list; l_user_meta != NULL; l_user_meta = l_user_meta->next)
{
user_meta = (NvDsUserMeta*)(l_user_meta->data);
if (user_meta->base_meta.meta_type == NVDS_PREPROCESS_BATCH_META)
{
GstNvDsPreProcessBatchMeta* preprocess_batchmeta = (GstNvDsPreProcessBatchMeta*)(user_meta->user_meta_data);
for (auto& roi_meta : preprocess_batchmeta->roi_vector)
{
NvDsMetaList* l_classifier = NULL;
for (l_classifier = roi_meta.classifier_meta_list; l_classifier != NULL;
l_classifier = l_classifier->next)
{
NvDsClassifierMeta* classifier_meta = (NvDsClassifierMeta*)(l_classifier->data);
NvDsLabelInfoList* l_label;
for (l_label = classifier_meta->label_info_list; l_label != NULL; l_label = l_classifier->next)
{
NvDsLabelInfo* label_info = (NvDsLabelInfo*)l_label->data;
std::cout << "Classification result - source-id: " << roi_meta.frame_meta->source_id
<< ", frame-num: " << roi_meta.frame_meta->frame_num
<< ", type: " << classifier_meta->classifier_type
<< ", label: " << label_info->result_label << std::endl;
}
}
}
}
}
return GST_PAD_PROBE_OK;
}
Configuration Details
Here are the nvinfer configuration files for each model:
Nvinfer 3D ActionRecognitionNet Configuration (NCSHW):
property:
tlt-encoded-model: ../model_files/resnet18_3d_rgb_hmdb5_32.etlt
tlt-model-key: nvidia_tao
model-engine-file: ../model_files/resnet18_3d_rgb_hmdb5_32.etlt_b2_gpu0_fp16.engine
labelfile-path: ../model_files/labels.txt
gpu-id: 0
network-mode: 2 # 0=FP32, 1=INT8, 2=FP16 mode
input-tensor-from-meta: 1 # requires preprocess metadata input
network-type: 1 # 0=Detection, 1=Classifier 2=Segmentation, 3=Instance Segmentation, 100: other
classifier-type: action3d
tensor-meta-pool-size: 8
Nvinfer XCLIP Configuration (NSCHW):
property:
model-engine-file: ../model_files/xclip-base-patch16-zero-shot.onnx_b8_gpu0_fp16.engine
labelfile-path: ../model_files/labels.txt
gpu-id: 0
network-mode: 2 # 0=FP32, 1=INT8, 2=FP16 mode
input-tensor-from-meta: 1 # requires preprocess metadata input
network-type: 1 # 0=Detection, 1=Classifier 2=Segmentation, 3=Instance Segmentation, 100: other
classifier-type: recognition
tensor-meta-pool-size: 8
Here are the custom sequence prepocessing configuration files for each model:
Sequence prepocessing 3D ActionRecognitionNet Configuration (NCSHW):
[property]
target-unique-ids=1
# 0=process on objects 1=process on frames
process-on-frame=1
# network-input-shape: batch, channel, sequence, height, width
# 3D sequence of 32 images
network-input-shape=1;3;32;224;224
# 0=RGB, 1=BGR, 2=GRAY
network-color-format=0
# 0=NCHW, 1=NHWC, 2=CUSTOM
network-input-order=2
# 0=FP32, 1=UINT8, 2=INT8, 3=UINT32, 4=INT32, 5=FP16
tensor-data-type=0
tensor-name=input_rgb
processing-width=224
processing-height=224
maintain-aspect-ratio=1
symmetric-padding=1
# 0=NVBUF_MEM_DEFAULT 1=NVBUF_MEM_CUDA_PINNED 2=NVBUF_MEM_CUDA_DEVICE
# 3=NVBUF_MEM_CUDA_UNIFIED 4=NVBUF_MEM_SURFACE_ARRAY(Jetson)
scaling-pool-memory-type=0
# 0=NvBufSurfTransformCompute_Default 1=NvBufSurfTransformCompute_GPU
# 2=NvBufSurfTransformCompute_VIC(Jetson)
scaling-pool-compute-hw=0
# Scaling Interpolation method
# 0=NvBufSurfTransformInter_Nearest 1=NvBufSurfTransformInter_Bilinear 2=NvBufSurfTransformInter_Algo1
# 3=NvBufSurfTransformInter_Algo2 4=NvBufSurfTransformInter_Algo3 5=NvBufSurfTransformInter_Algo4
# 6=NvBufSurfTransformInter_Default
scaling-filter=0
# max buffer in scaling buffer pool
scaling-buf-pool-size=8
# max buffer in tensor buffer pool
tensor-buf-pool-size=8
custom-lib-path=/usr/src/myapp/build/libs/deepstream_lib/custom_sequence_preprocess/libcustom_sequence_preprocess.so
custom-tensor-preparation-function=CustomSequenceTensorPreparation
# 3D conv custom params
[user-configs]
channel-scale-factors=0.007843137;0.007843137;0.007843137
channel-mean-offsets=127.5;127.5;127.5
stride=32
subsample=0
[group-0]
src-ids=0;1
process-on-roi=0
Sequence prepocessing XCLIP Configuration (NSCHW):
[property]
target-unique-ids=1
# 0=process on objects 1=process on frames
process-on-frame=1
# network-input-shape: batch, channel, sequence, height, width
# 3D sequence of 32 images
network-input-shape=1;3;32;224;224
# 0=RGB, 1=BGR, 2=GRAY
network-color-format=0
# 0=NCHW, 1=NHWC, 2=CUSTOM
network-input-order=2
# 0=FP32, 1=UINT8, 2=INT8, 3=UINT32, 4=INT32, 5=FP16
tensor-data-type=0
tensor-name=input_rgb
processing-width=224
processing-height=224
maintain-aspect-ratio=1
symmetric-padding=1
# 0=NVBUF_MEM_DEFAULT 1=NVBUF_MEM_CUDA_PINNED 2=NVBUF_MEM_CUDA_DEVICE
# 3=NVBUF_MEM_CUDA_UNIFIED 4=NVBUF_MEM_SURFACE_ARRAY(Jetson)
scaling-pool-memory-type=0
# 0=NvBufSurfTransformCompute_Default 1=NvBufSurfTransformCompute_GPU
# 2=NvBufSurfTransformCompute_VIC(Jetson)
scaling-pool-compute-hw=0
# Scaling Interpolation method
# 0=NvBufSurfTransformInter_Nearest 1=NvBufSurfTransformInter_Bilinear 2=NvBufSurfTransformInter_Algo1
# 3=NvBufSurfTransformInter_Algo2 4=NvBufSurfTransformInter_Algo3 5=NvBufSurfTransformInter_Algo4
# 6=NvBufSurfTransformInter_Default
scaling-filter=0
# max buffer in scaling buffer pool
scaling-buf-pool-size=8
# max buffer in tensor buffer pool
tensor-buf-pool-size=8
custom-lib-path=/usr/src/myapp/build/libs/deepstream_lib/custom_sequence_preprocess/libcustom_sequence_preprocess.so
custom-tensor-preparation-function=CustomSequenceTensorPreparation
# 3D conv custom params
[user-configs]
channel-scale-factors=0.007843137;0.007843137;0.007843137
channel-mean-offsets=127.5;127.5;127.5
stride=32
subsample=0
[group-0]
src-ids=0;1
process-on-roi=0
The configuration files for custom sequence preprocessing are nearly identical, with differences only in the network-input-shape
, channel-scale-factors
, and channel-mean-offsets
to suit each model’s specifications.
Issue Encountered
When using an RTSP stream with H.264 encoding at 640x640 resolution and 20 fps, the pipeline works as expected with both configurations. Since the preprocessing step is setted with a stride of 32, the model successfully produces inference outputs every 32 frames, confirming that the custom sequence preprocessing library is functioning correctly on both NCSHW and NSCHW configurations. The result is confirmed even if I use an MP4 video as input, instead of the RTSP stream, with the same resolution.
However, when I switch to an RTSP stream with a resolution of 1920x1080 at 30 fps, only the 3D ActionRecognitionNet (NCSHW) configuration behaves as expected, producing inference outputs every 32 frames. In contrast, the XCLIP (NSCHW) configuration behaves inconsistently, producing inference outputs sporadically—sometimes after thousands of frames, other times after hundreds, with no apparent pattern. The issue is the same even if I use an MP4 video as input, instead of the RTSP stream, with the same resolution.
Interestingly, debug logs indicate that the Custom Sequence Preprocess library is correctly preparing a full batched sequence tensor every 32 frames:
A full batched sequence tensor is ready on last frame
This suggests that the issue may not lie within the preprocessing library.
I’m wondering if the problem could be related to the model size. The XCLIP model is significantly larger than the 3D ActionRecognitionNet model, and its performance metrics from trtexec
are as follows:
[08/30/2024-13:08:36] [I] === Performance summary ===
[08/30/2024-13:08:36] [I] Throughput: 3.5137 qps
[08/30/2024-13:08:36] [I] Latency: min = 266.574 ms, max = 280.172 ms, mean = 273.632 ms, median = 273.005 ms, percentile(90%) = 278.941 ms, percentile(95%) = 280.172 ms, percentile(99%) = 280.172 ms
[08/30/2024-13:08:36] [I] Enqueue Time: min = 0.677918 ms, max = 2.04102 ms, mean = 1.12804 ms, median = 1.04211 ms, percentile(90%) = 1.37646 ms, percentile(95%) = 2.04102 ms, percentile(99%) = 2.04102 ms
[08/30/2024-13:08:36] [I] H2D Latency: min = 6.88428 ms, max = 8.15256 ms, mean = 7.40798 ms, median = 7.38599 ms, percentile(90%) = 7.72327 ms, percentile(95%) = 8.15256 ms, percentile(99%) = 8.15256 ms
[08/30/2024-13:08:36] [I] GPU Compute Time: min = 259.231 ms, max = 273.051 ms, mean = 266.216 ms, median = 265.581 ms, percentile(90%) = 271.731 ms, percentile(95%) = 273.051 ms, percentile(99%) = 273.051 ms
[08/30/2024-13:08:36] [I] D2H Latency: min = 0.00415039 ms, max = 0.02771 ms, mean = 0.00793021 ms, median = 0.00653076 ms, percentile(90%) = 0.00842285 ms, percentile(95%) = 0.02771 ms, percentile(99%) = 0.02771 ms
[08/30/2024-13:08:36] [I] Total Host Walltime: 3.98441 s
[08/30/2024-13:08:36] [I] Total GPU Compute Time: 3.72702 s
Questions
- Could the inconsistent inference output be due to the higher input resolution in combination to the large size and slower processing speed of the XCLIP model in combination?
- Is there a possibility that the sequence tensor prepared by the preprocessing step is not reaching the model consistently, or is there an issue with how the model processes these sequences?
Any insights or suggestions to help diagnose this issue would be greatly appreciated.
Thanks in advance for your assistance!