Issue Summary
Unable to integrate Grounding DINO (GDINO) model with DeepStream 7.1 due to multi-input architecture incompatibility. The model requires 6 inputs (image + 5 text-related tensors) while DeepStream expects single-input models.
Environment Details
- DeepStream Version: 7.1
- TensorRT Version: 10.3.0
- CUDA Version: 12.6
- Platform: NVIDIA Jetson (ARM64) / Linux 5.15.148-tegra
- Model: Grounding DINO Swin-T (from NGC:
nvidia/tao/grounding_dino_swin_tiny_commercial_deployable) - Use Case: Custom object detection for “vehicle accident”, “person jaywalking”, “fire”, “smoke”
Model Architecture Analysis
Grounding DINO is a vision-language model requiring multiple inputs:
# From TAO Deploy inference.py
inputs = (
batches, # [1, 3, 544, 960] - Image tensor
input_ids, # [1, max_len] - Tokenized text
attention_mask, # [1, max_len] - Attention mask
position_ids, # [1, max_len] - Position IDs
token_type_ids, # [1, max_len] - Token types
text_self_attention_masks # [1, max_len, max_len] - Self-attention
)
Integration Steps Attempted
1. Standard DeepStream Configuration
Created standard TAO model configuration following documentation:
[property]
model-engine-file=/path/to/grounding_dino_swin_tiny_commercial_deployable_ds.engine
parse-bbox-func-name=NvDsInferParseCustomGroundingDINOTAO
custom-lib-path=/path/to/libnvds_infercustomparser_tao.so
output-blob-names=pred_boxes;pred_logits
infer-dims=3;544;960
2. TensorRT Engine Generation
Successfully built TensorRT engine using trtexec:
cd /TensorRT && ./bin/trtexec \
--onnx=grounding_dino_swin_tiny_commercial_deployable.onnx \
--memPoolSize=workspace:4096M \
--saveEngine=grounding_dino_swin_tiny_commercial_deployable_ds.engine
Result: ✅ Engine builds successfully with 6.22 QPS throughput
3. Custom Parser Implementation
Verified that NvDsInferParseCustomGroundingDINOTAO parser exists in TAO post-processor library:
// From nvdsinfer_custombboxparser_tao.cpp (lines 568-653)
extern "C" bool NvDsInferParseCustomGroundingDINOTAO(
std::vector<NvDsInferLayerInfo> const &outputLayersInfo,
NvDsInferNetworkInfo const &networkInfo,
NvDsInferParseDetectionParams const &detectionParams,
std::vector<NvDsInferObjectDetectionInfo> &objectList
)
Result: ✅ Parser exists and compiles successfully
Error Messages Encountered
Primary Error
ERROR: Unknown data type for bound layer i(attention_mask)
ERROR: initialize backend context failed on layer: 2, nvinfer error:NVDSINFER_TENSORRT_ERROR
ERROR: Failed to get fullDimLayersInfo of profile idx:0, nvinfer error:NVDSINFER_TENSORRT_ERROR
ERROR: Failed to initialize TRT backend, nvinfer error:NVDSINFER_TENSORRT_ERROR
Secondary Errors
Warning from NvDsInferContextImpl::deserializeEngineAndBackend()
create backend context from engine from file failed
Error in NvDsInferContextImpl::generateBackendContext()
deserialize backend context from engine from file failed, try rebuild
ERROR: failed to build network since there is no model file matched.
ERROR: failed to build network.
Root Cause Analysis
1. Multi-Input Architecture Incompatibility
- GDINO requires 6 inputs: DeepStream attempts to bind all input tensors
- DeepStream expects 1 input: Standard inference pipeline assumes single image input
- Binding failure:
attention_maskand other text inputs cannot be bound
2. Missing Text Processing Pipeline
GDINO requires text processing for captions:
# Text processing required but not available in DeepStream
caption = ["vehicle accident . person jaywalking . fire . smoke ."]
input_ids, attention_mask, position_ids, token_type_ids, text_self_attention_masks = tokenize_captions(
tokenizer, classes, caption, max_text_len
)
3. Custom Plugin Dependencies
TensorRT engine contains custom plugins not available in standard DeepStream:
Cannot find plugin: MultiscaleDeformableAttnPlugin_TRT, version: 1, namespace:.
I see in [DINO tutorial docs](https://docs.nvidia.com/tao/tao-toolkit/text/ds_tao/deformable_detr_ds.html#integrating-an-deformable-detr-model) this lugion is buildable and Ive done following the steps in the tutorial.
Attempted Solutions
Solution 1: Single-Input Model Conversion
Approach: Remove text inputs from ONNX model and embed text embeddings
# Modified ONNX graph to remove text inputs
for input_name in inputs_to_remove:
graph.input.remove(input_tensor)
Result: ❌ ONNX graph corruption, missing node references
Solution 2: Custom Parser Development
Approach: Create dedicated GDINO parser with text handling
extern "C" bool NvDsInferParseCustomGroundingDINO(/* ... */) {
// Custom implementation for GDINO outputs
}
Result: ❌ Duplicate symbol conflict with existing TAO parser
Solution 3: TensorRT Plugin Building
Approach: Build TensorRT OSS with required plugins
# Attempted to build MultiscaleDeformableAttnPlugin_TRT
Result: ❌ Complex build process, missing plugin sources
Verification Tests
Working Control Test
Confirmed DeepStream pipeline works with single-input models:
# Retail model test - SUCCESS
./deepstream-app -i Highway.mp4 -c pgie_retail_config.txt
# Result: ✅ Perfect inference, detections displayed correctly
TAO Deploy Verification
Confirmed GDINO works in TAO Deploy:
# TAO Deploy inference - SUCCESS
inferencer = GDINOInferencer(engine_path, batch_size=1)
pred_logits, pred_boxes = inferencer.infer(inputs)
# Result: ✅ Perfect detections on custom classes
Current Status
- ✅ TensorRT Engine: Builds successfully, good performance
- ✅ TAO Deploy: Works perfectly for GDINO inference
- ✅ DeepStream Pipeline: Confirmed working with single-input models
- ❌ Integration: Cannot load multi-input GDINO model in DeepStream
Questions for Community
-
Multi-Input Support: Does DeepStream 7.1 support multi-input models natively? If so, what configuration is required?
-
Text Processing: How can text embeddings be pre-computed and fed to DeepStream for vision-language models?
-
Plugin Integration: Where can we obtain the
MultiscaleDeformableAttnPlugin_TRTplugin for GDINO? -
Alternative Approaches: Are there recommended patterns for integrating vision-language models with DeepStream?
Request for Support
Looking for:
- Technical guidance on multi-input model integration
- Plugin availability for GDINO-specific operations
- Best practices for vision-language model deployment
- Roadmap information on enhanced model support
Files and Logs
Available for review:
- TensorRT engine build logs
- DeepStream application source code
- Configuration files tested
- Error logs with full stack traces
Any guidance on resolving this integration challenge would be greatly appreciated!
Tags: deepstream tensorrt #grounding-dino #multi-input #vision-language #tao-deploy #integration-issue