DeepStream 7.1 + Grounding DINO Integration Issue: Multi-Input Model Support

Issue Summary

Unable to integrate Grounding DINO (GDINO) model with DeepStream 7.1 due to multi-input architecture incompatibility. The model requires 6 inputs (image + 5 text-related tensors) while DeepStream expects single-input models.

Environment Details

  • DeepStream Version: 7.1
  • TensorRT Version: 10.3.0
  • CUDA Version: 12.6
  • Platform: NVIDIA Jetson (ARM64) / Linux 5.15.148-tegra
  • Model: Grounding DINO Swin-T (from NGC: nvidia/tao/grounding_dino_swin_tiny_commercial_deployable)
  • Use Case: Custom object detection for “vehicle accident”, “person jaywalking”, “fire”, “smoke”

Model Architecture Analysis

Grounding DINO is a vision-language model requiring multiple inputs:

# From TAO Deploy inference.py
inputs = (
    batches,                    # [1, 3, 544, 960] - Image tensor
    input_ids,                  # [1, max_len] - Tokenized text
    attention_mask,             # [1, max_len] - Attention mask
    position_ids,               # [1, max_len] - Position IDs
    token_type_ids,             # [1, max_len] - Token types
    text_self_attention_masks   # [1, max_len, max_len] - Self-attention
)

Integration Steps Attempted

1. Standard DeepStream Configuration

Created standard TAO model configuration following documentation:

[property]
model-engine-file=/path/to/grounding_dino_swin_tiny_commercial_deployable_ds.engine
parse-bbox-func-name=NvDsInferParseCustomGroundingDINOTAO
custom-lib-path=/path/to/libnvds_infercustomparser_tao.so
output-blob-names=pred_boxes;pred_logits
infer-dims=3;544;960

2. TensorRT Engine Generation

Successfully built TensorRT engine using trtexec:

cd /TensorRT && ./bin/trtexec \
    --onnx=grounding_dino_swin_tiny_commercial_deployable.onnx \
    --memPoolSize=workspace:4096M \
    --saveEngine=grounding_dino_swin_tiny_commercial_deployable_ds.engine

Result: ✅ Engine builds successfully with 6.22 QPS throughput

3. Custom Parser Implementation

Verified that NvDsInferParseCustomGroundingDINOTAO parser exists in TAO post-processor library:

// From nvdsinfer_custombboxparser_tao.cpp (lines 568-653)
extern "C" bool NvDsInferParseCustomGroundingDINOTAO(
    std::vector<NvDsInferLayerInfo> const &outputLayersInfo,
    NvDsInferNetworkInfo const &networkInfo,
    NvDsInferParseDetectionParams const &detectionParams,
    std::vector<NvDsInferObjectDetectionInfo> &objectList
)

Result: ✅ Parser exists and compiles successfully

Error Messages Encountered

Primary Error

ERROR: Unknown data type for bound layer i(attention_mask)
ERROR: initialize backend context failed on layer: 2, nvinfer error:NVDSINFER_TENSORRT_ERROR
ERROR: Failed to get fullDimLayersInfo of profile idx:0, nvinfer error:NVDSINFER_TENSORRT_ERROR
ERROR: Failed to initialize TRT backend, nvinfer error:NVDSINFER_TENSORRT_ERROR

Secondary Errors

Warning from NvDsInferContextImpl::deserializeEngineAndBackend() 
create backend context from engine from file failed

Error in NvDsInferContextImpl::generateBackendContext() 
deserialize backend context from engine from file failed, try rebuild

ERROR: failed to build network since there is no model file matched.
ERROR: failed to build network.

Root Cause Analysis

1. Multi-Input Architecture Incompatibility

  • GDINO requires 6 inputs: DeepStream attempts to bind all input tensors
  • DeepStream expects 1 input: Standard inference pipeline assumes single image input
  • Binding failure: attention_mask and other text inputs cannot be bound

2. Missing Text Processing Pipeline

GDINO requires text processing for captions:

# Text processing required but not available in DeepStream
caption = ["vehicle accident . person jaywalking . fire . smoke ."]
input_ids, attention_mask, position_ids, token_type_ids, text_self_attention_masks = tokenize_captions(
    tokenizer, classes, caption, max_text_len
)

3. Custom Plugin Dependencies

TensorRT engine contains custom plugins not available in standard DeepStream:

Cannot find plugin: MultiscaleDeformableAttnPlugin_TRT, version: 1, namespace:.
I see in [DINO tutorial docs](https://docs.nvidia.com/tao/tao-toolkit/text/ds_tao/deformable_detr_ds.html#integrating-an-deformable-detr-model) this lugion is buildable and Ive done following the steps in the tutorial.

Attempted Solutions

Solution 1: Single-Input Model Conversion

Approach: Remove text inputs from ONNX model and embed text embeddings

# Modified ONNX graph to remove text inputs
for input_name in inputs_to_remove:
    graph.input.remove(input_tensor)

Result: ❌ ONNX graph corruption, missing node references

Solution 2: Custom Parser Development

Approach: Create dedicated GDINO parser with text handling

extern "C" bool NvDsInferParseCustomGroundingDINO(/* ... */) {
    // Custom implementation for GDINO outputs
}

Result: ❌ Duplicate symbol conflict with existing TAO parser

Solution 3: TensorRT Plugin Building

Approach: Build TensorRT OSS with required plugins

# Attempted to build MultiscaleDeformableAttnPlugin_TRT

Result: ❌ Complex build process, missing plugin sources

Verification Tests

Working Control Test

Confirmed DeepStream pipeline works with single-input models:

# Retail model test - SUCCESS
./deepstream-app -i Highway.mp4 -c pgie_retail_config.txt
# Result: ✅ Perfect inference, detections displayed correctly

TAO Deploy Verification

Confirmed GDINO works in TAO Deploy:

# TAO Deploy inference - SUCCESS
inferencer = GDINOInferencer(engine_path, batch_size=1)
pred_logits, pred_boxes = inferencer.infer(inputs)
# Result: ✅ Perfect detections on custom classes

Current Status

  • TensorRT Engine: Builds successfully, good performance
  • TAO Deploy: Works perfectly for GDINO inference
  • DeepStream Pipeline: Confirmed working with single-input models
  • Integration: Cannot load multi-input GDINO model in DeepStream

Questions for Community

  1. Multi-Input Support: Does DeepStream 7.1 support multi-input models natively? If so, what configuration is required?

  2. Text Processing: How can text embeddings be pre-computed and fed to DeepStream for vision-language models?

  3. Plugin Integration: Where can we obtain the MultiscaleDeformableAttnPlugin_TRT plugin for GDINO?

  4. Alternative Approaches: Are there recommended patterns for integrating vision-language models with DeepStream?

Request for Support

Looking for:

  1. Technical guidance on multi-input model integration
  2. Plugin availability for GDINO-specific operations
  3. Best practices for vision-language model deployment
  4. Roadmap information on enhanced model support

Files and Logs

Available for review:

  • TensorRT engine build logs
  • DeepStream application source code
  • Configuration files tested
  • Error logs with full stack traces

Any guidance on resolving this integration challenge would be greatly appreciated!


Tags: deepstream tensorrt #grounding-dino #multi-input #vision-language #tao-deploy #integration-issue

DeepStream does not support GenAI models. Please refer to the Grounding DINO (GDINO) — Jetson Platform Services documentation for the GenAI solution in Jetson. There is also forum for JPS: Latest Intelligent Video Analytics/Metropolis Microservices for Jetson topics - NVIDIA Developer Forums

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.