"Thank you for the reference to topic 338052. I’d like to respectfully clarify whether Grounding DINO should be categorized as a GenAI model in this context.
Technical clarification needed:
Grounding DINO, while incorporating transformer architecture and text processing, functions primarily as an object detection model - similar to traditional YOLO or RCNN models but with text-guided detection capabilities.
Key points for consideration:
Primary output: Bounding boxes and detection scores (same as traditional detectors)
Core use case: Object detection in video streams (typical DeepStream application)
Model behavior: Deterministic detection, not generative content creation
My specific questions:
Does the “GenAI” limitation apply to any model using transformers/attention mechanisms?
Or does it specifically target generative models (text/image generation, chatbots, etc.)?
Would other vision-transformer models (like DETR, Swin Transformer) also be considered “GenAI”?
Context: I’m trying to use this for standard video analytics (accident detection, fire detection) - which seems aligned with DeepStream’s core purpose, just with a more advanced detection model.
Would appreciate clarification on the technical boundaries of this limitation."
Could you attach the link of your Grounding DINO model?
For the multi-inputs, this might not be an insurmountable problem. You can use our nvdspreprocess plugin to customize that or modify our nvinfer source code directly.
But for the text prompt tensor, does this input require operations such as tokenization and embedding? Currently, there is no ready-made module in Gstreamer for performing these operations. This might require a significant amount of customized code. This might be the technical boundaries of this limitation.
I am working on integrating a fine-tuned Grounding DINO model with DeepStream. The fine-tuning was done using this notebook:
In the tao_tutorials, the following NGC model is used:
I came across this forum reply regarding DeepStream integration:
From what I have gathered, there are no available reference materials on using the nvdspreprocess plugin for Grounding DINO integration. I also understand that since DeepStream does not natively provide modules for tokenization and embedding required by the text prompt input, a significant amount of customized development is necessary.
Any further guidance or resources regarding this integration would be greatly appreciated.