Deployment type: Helm (standard version from Nvidia VSS documentation)
Model: nvila-lite-15b-highres-lita
Audio enabled
Problem description: We are trying to get step wise detailed instructions from a video that explains any equipment assembly. Plus we are getting a summary of safety instructions and parts being used during the video.
Video Timestamps are crucial in extracting exact frames when the instruction is being relayed, this will be done using a python script that gets the VSS summarization output (text) and the video file itself.
Idea is to use these steps and frame grabs to be assembled as a document. Since audio is optional and might not always relate to the VSS identified steps, we cant use that to get frame times.
We have experimented with different prompts and VSS customization parameters but are not able to get proper output.
Would appreciate, if someone who has done similar activity and has deduced what parameters can be given, can please share some insights.
Your question means that the generated summary is segmented by chunk-duration, but these segments are not precise enough, right?
What is your intended goal? Can you give an example?
Intended goal is to get output from an instructional video as individual steps and corresponding timeframes (start and end time) between when that instruction is being shown or spoken.
Timestamp output we are getting, is not related to the video content at all. At times it gives all steps for a 10 min video distributed over 3-4 mins.
The attachment “forum_shared_version.txt” has around 8 examples of output for different prompts and parameters, processing the same video and all are incorrect. Although the identified steps are fine.
For the output, yes I need segmentation like you mentioned.
Although I didn’t get what you meant by “without using the chunk function of VSS”. Could you please explain what is the api option to use if I have to bypass something in VSS.
From another discussion I was told to not include any temporal information in the VLM prompt (the first prompt), use vlm and summarization temperature as 0 and adjust the chunk_duration to nearly cover one instruction.
I am sharing another set of prompts and output based on above suggestion. Although, this too did not work.
I mean using the VILA model directly, because VSS will divide the video into chunks according to chunk_duration.
I believe this requirement may require reliance on speech and semantic segmentation, as video analysis may not fully meet expectations. We will discuss this issue internally.
There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks.