Inaccurate timestamp for detailed steps generated for an instructional video of equipment assembly process

Please provide the following information when creating a topic:

  • Hardware Platform (GPU model and numbers): Standard Nvidia launchpad with 8xH100
  • System Memory: 2TB
  • Ubuntu Version: 22.04.4
  • NVIDIA GPU Driver Version (valid for GPU only): 570.158.01
  • Issue Type( questions, new requirements, bugs): bugs
  • How to reproduce the issue ? (This is for bugs. Including the command line used and other details for reproducing)
  • Deployment details:
    • Deployment type: Helm (standard version from Nvidia VSS documentation)
    • Model: nvila-lite-15b-highres-lita
    • Audio enabled

Problem description: We are trying to get step wise detailed instructions from a video that explains any equipment assembly. Plus we are getting a summary of safety instructions and parts being used during the video.
Video Timestamps are crucial in extracting exact frames when the instruction is being relayed, this will be done using a python script that gets the VSS summarization output (text) and the video file itself.
Idea is to use these steps and frame grabs to be assembled as a document. Since audio is optional and might not always relate to the VSS identified steps, we cant use that to get frame times.

We have experimented with different prompts and VSS customization parameters but are not able to get proper output.
Would appreciate, if someone who has done similar activity and has deduced what parameters can be given, can please share some insights.

Your question means that the generated summary is segmented by chunk-duration, but these segments are not precise enough, right?
What is your intended goal? Can you give an example?

Intended goal is to get output from an instructional video as individual steps and corresponding timeframes (start and end time) between when that instruction is being shown or spoken.

Timestamp output we are getting, is not related to the video content at all. At times it gives all steps for a 10 min video distributed over 3-4 mins.

The attachment “forum_shared_version.txt” has around 8 examples of output for different prompts and parameters, processing the same video and all are incorrect. Although the identified steps are fine.

This may require semantics to perform the correct segmentation.

You may consider using only vlm (nvila) without using the chunk function of vss.

I think you need something like this semantic segmentation time period, right?

0:02 --> 0:10: xxx
0:10 --> 0:35: xxx 
0:35 --> 2:00: xxx

For the output, yes I need segmentation like you mentioned.
Although I didn’t get what you meant by “without using the chunk function of VSS”. Could you please explain what is the api option to use if I have to bypass something in VSS.

From another discussion I was told to not include any temporal information in the VLM prompt (the first prompt), use vlm and summarization temperature as 0 and adjust the chunk_duration to nearly cover one instruction.
I am sharing another set of prompts and output based on above suggestion. Although, this too did not work.

vss-prompts-shared-part2.txt (28.2 KB)

I mean using the VILA model directly, because VSS will divide the video into chunks according to chunk_duration.

I believe this requirement may require reliance on speech and semantic segmentation, as video analysis may not fully meet expectations. We will discuss this issue internally.

You can set chunk_duration=0 to prevent the video from being divided into chunks.

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.