Image Caption with Deepstream

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU) GPU
• DeepStream Version 6.2
• JetPack Version (valid for Jetson only)
• TensorRT Version
• NVIDIA GPU Driver Version (valid for GPU only)
• Issue Type( questions, new requirements, bugs) question
• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)
• Requirement details( This is for new requirement. Including the module name-for which plugin or for which sample application, the function description)

Hello, I am relatively new to using the Deepstream SDK, and I would like to ask if there is an available plugin to implement a custom image caption model to generate a text caption for every frame of the video.

I have read the documentation but so far it seems that gst-nvinfer does not support this function.

Thank you!

Do you mean to draw texts over video?

Hi Fiona, I am not looking to draw the texts over video. Rather I would like to generate a text caption for every video frame using the BLIP-2 model. Would that be possible? And how would I go about doing that using Deepstream?

More details:
In the BLIP-2 model, the image (frame) is first passed through a visual processor, which is basically an image encoder. The output is a tensor with dimensions of [1, 3, 768]. This tensor will then be passed as input to the BLIP-2 model to generate an text caption of the image.

Since there are 2 steps, I am thinking of using 2 nvinfer plugins. One for generating the image tensor and the other for generating the image caption. However, when looking at the network types, I see that nvinfer only supports detector, classifier, segmentation, and instance segmentation. I am not sure what exactly is the difference between these network types in terms of the implementation. Hence, I am wondering if I can even use nvinfer for the above application.


What is the image caption? A image or a string?

You miss the “others” type. “network-type=100”. Please refer to the sample /opt/nvidia/deepstream/deepstream/sources/apps/sample_apps/deepstream-infer-tensor-meta-test

To generate the tensor with dimensions of [1, 3, 768], you can use gst-nvpreprocess which is designed for generating any non-image tensor data or any non-single image tensor. A typical sample is /opt/nvidia/deepstream/deepstream/sources/apps/sample_apps/deepstream-3d-action-recognition.

The image caption is actually a string, that is supposed to be a good description of the image.

Okay noted thanks for the information. I will take a look at that and try that out. Can I just confirm that with “network-type=100”, there will not be post-processing involved?

Okay noted I will refer to the example. Sorry for my earlier confusion. There is actually a 2 stage process. The first stage is the visual processor. This does some basic functions like scaling and normalising of the input image. The output is a tensor of shape [1, 3, 224, 224]. There is no model inference done at this stage.

The second stage is the image encoder. It will take in the output tensor from the visual processor and output a tensor of shape [1, 257, 1408]. The image encoder is a deep learning model and model inference occurs at this stage. Can I check, the best way to structure the pipeline for this case would be to have a Gst-nvdspreprocess (visual processor) → Gst-nvinfer (image encoder)? May I also ask what is the difference between Gst-nvdspreprocess and Gst-nvinfer?

Thank you so much for your help.

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

No. There will not be any default internal post-processing with "network-type=100”.

Such preprocessing can be converred by gst-nvinfer internal pre-processing. DeepStream SDK FAQ - Intelligent Video Analytics / DeepStream SDK - NVIDIA Developer Forums

This can be done by gst-nvinfer too.

No. As your description. The gst-nvinfer + customized post-processing is enough. No gst-nvdspreprocess is needed.
Please refer to the sample /opt/nvidia/deepstream/deepstream/sources/apps/sample_apps/deepstream-infer-tensor-meta-test

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.