Please provide complete information as applicable to your setup.
• Hardware Platform (Jetson / GPU) GPU • DeepStream Version 6.2 • JetPack Version (valid for Jetson only) • TensorRT Version 184.108.40.206 • NVIDIA GPU Driver Version (valid for GPU only) • Issue Type( questions, new requirements, bugs) question • How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing) • Requirement details( This is for new requirement. Including the module name-for which plugin or for which sample application, the function description)
Hello, I am relatively new to using the Deepstream SDK, and I would like to ask if there is an available plugin to implement a custom image caption model to generate a text caption for every frame of the video.
I have read the documentation but so far it seems that gst-nvinfer does not support this function.
Hi Fiona, I am not looking to draw the texts over video. Rather I would like to generate a text caption for every video frame using the BLIP-2 model. Would that be possible? And how would I go about doing that using Deepstream?
In the BLIP-2 model, the image (frame) is first passed through a visual processor, which is basically an image encoder. The output is a tensor with dimensions of [1, 3, 768]. This tensor will then be passed as input to the BLIP-2 model to generate an text caption of the image.
Since there are 2 steps, I am thinking of using 2 nvinfer plugins. One for generating the image tensor and the other for generating the image caption. However, when looking at the network types, I see that nvinfer only supports detector, classifier, segmentation, and instance segmentation. I am not sure what exactly is the difference between these network types in terms of the implementation. Hence, I am wondering if I can even use nvinfer for the above application.
You miss the “others” type. “network-type=100”. Please refer to the sample /opt/nvidia/deepstream/deepstream/sources/apps/sample_apps/deepstream-infer-tensor-meta-test
To generate the tensor with dimensions of [1, 3, 768], you can use gst-nvpreprocess which is designed for generating any non-image tensor data or any non-single image tensor. A typical sample is /opt/nvidia/deepstream/deepstream/sources/apps/sample_apps/deepstream-3d-action-recognition.
The image caption is actually a string, that is supposed to be a good description of the image.
Okay noted thanks for the information. I will take a look at that and try that out. Can I just confirm that with “network-type=100”, there will not be post-processing involved?
Okay noted I will refer to the example. Sorry for my earlier confusion. There is actually a 2 stage process. The first stage is the visual processor. This does some basic functions like scaling and normalising of the input image. The output is a tensor of shape [1, 3, 224, 224]. There is no model inference done at this stage.
The second stage is the image encoder. It will take in the output tensor from the visual processor and output a tensor of shape [1, 257, 1408]. The image encoder is a deep learning model and model inference occurs at this stage. Can I check, the best way to structure the pipeline for this case would be to have a Gst-nvdspreprocess (visual processor) → Gst-nvinfer (image encoder)? May I also ask what is the difference between Gst-nvdspreprocess and Gst-nvinfer?
No. As your description. The gst-nvinfer + customized post-processing is enough. No gst-nvdspreprocess is needed.
Please refer to the sample /opt/nvidia/deepstream/deepstream/sources/apps/sample_apps/deepstream-infer-tensor-meta-test