Inaccurate timestamp for detailed steps generated for an instructional video of equipment assembly process

shalabh.saxena · September 2, 2025, 4:33am

Please provide the following information when creating a topic:

Hardware Platform (GPU model and numbers): Standard Nvidia launchpad with 8xH100
System Memory: 2TB
Ubuntu Version: 22.04.4
NVIDIA GPU Driver Version (valid for GPU only): 570.158.01
Issue Type( questions, new requirements, bugs): bugs
How to reproduce the issue ? (This is for bugs. Including the command line used and other details for reproducing)
- Similar video sample:
- Prompts and command being used:
  
  forum_shared_version.txt (8.4 KB)
Deployment details:
- Deployment type: Helm (standard version from Nvidia VSS documentation)
- Model: nvila-lite-15b-highres-lita
- Audio enabled

Problem description: We are trying to get step wise detailed instructions from a video that explains any equipment assembly. Plus we are getting a summary of safety instructions and parts being used during the video.
Video Timestamps are crucial in extracting exact frames when the instruction is being relayed, this will be done using a python script that gets the VSS summarization output (text) and the video file itself.
Idea is to use these steps and frame grabs to be assembled as a document. Since audio is optional and might not always relate to the VSS identified steps, we cant use that to get frame times.

We have experimented with different prompts and VSS customization parameters but are not able to get proper output.
Would appreciate, if someone who has done similar activity and has deduced what parameters can be given, can please share some insights.

junshengy · September 2, 2025, 9:06am

Your question means that the generated summary is segmented by chunk-duration, but these segments are not precise enough, right?
What is your intended goal? Can you give an example?

shalabh.saxena · September 2, 2025, 11:51am

Intended goal is to get output from an instructional video as individual steps and corresponding timeframes (start and end time) between when that instruction is being shown or spoken.

Timestamp output we are getting, is not related to the video content at all. At times it gives all steps for a 10 min video distributed over 3-4 mins.

The attachment “forum_shared_version.txt” has around 8 examples of output for different prompts and parameters, processing the same video and all are incorrect. Although the identified steps are fine.

junshengy · September 3, 2025, 11:27am

This may require semantics to perform the correct segmentation.

You may consider using only vlm (nvila) without using the chunk function of vss.

I think you need something like this semantic segmentation time period, right?

0:02 --> 0:10: xxx
0:10 --> 0:35: xxx 
0:35 --> 2:00: xxx

shalabh.saxena · September 5, 2025, 11:18am

For the output, yes I need segmentation like you mentioned.
Although I didn’t get what you meant by “without using the chunk function of VSS”. Could you please explain what is the api option to use if I have to bypass something in VSS.

From another discussion I was told to not include any temporal information in the VLM prompt (the first prompt), use vlm and summarization temperature as 0 and adjust the chunk_duration to nearly cover one instruction.
I am sharing another set of prompts and output based on above suggestion. Although, this too did not work.

vss-prompts-shared-part2.txt (28.2 KB)

junshengy · September 9, 2025, 7:27am

I mean using the VILA model directly, because VSS will divide the video into chunks according to chunk_duration.

I believe this requirement may require reliance on speech and semantic segmentation, as video analysis may not fully meet expectations. We will discuss this issue internally.

junshengy · September 22, 2025, 2:19am

You can set chunk_duration=0 to prevent the video from being divided into chunks.

yingliu · October 21, 2025, 2:56am

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks.

system · November 4, 2025, 2:56am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Error recording and saving mp4 video segments (pipeline manipulation) using nvidia tx2 onboard camer... Jetson TX2	26	3625	October 18, 2021
Cuda Video Decoding Cuda Video decoding , seeking CUDA Programming and Performance	0	2854	February 10, 2010
00_video_decode Jetson TX2	8	1915	October 18, 2021
About the timestamp of video encoder Jetson TX2	23	4583	October 18, 2021
Encode H265, vps does not enable vps_timing_info_present_flag, vui does not enable timing_info_prese... Jetson TX2	15	2578	October 18, 2021
[MMAPI] Some questions about videoDecoder timestamp handling Jetson TX1	6	1025	October 18, 2021
Inconsistent Temporal Annotations in VSS Visual AI Agent inception , llama	7	161	September 16, 2025
Development Tool to Use Tegra Multimedia API Jetson TX2	2	441	October 18, 2021
VILA with VIA [New] Visual AI Agent demos-and-tutorials , llama	4	1208	December 24, 2024
Generate video from Lane Detection Driveworks demo DriveWorks	1	911	May 25, 2018

Inaccurate timestamp for detailed steps generated for an instructional video of equipment assembly process

Related topics