Is there any plan for including Video Understanding LLMs into Deepsream

Large Language Models are being popular for text generation, also ViT based Video Understanding models are gaining traction. With the advent of open LLMs and Video Understanding models such as Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding hopefully Video Understanding from CCTV camera footage in real-time will be feasible in very near future.

Such integration of Video Understanding from IP cameras and getting timestamped video contextual description will be the future of video surveillance, and hope DeepStream SDK will take the lead utilizing NVIDIA GPUs for offline video understanding. Requesting the DS developers to for such integration into the SDK for a great expansion of its existing video and audio based multimodal capabilities.

Thank you for your suggestion!

We will provide some GenAI samples in the upcoming DeepStream release. Wish it will help you!