Context
We are attempting to fine‑tune Cosmos Reason1 for sports “key‑moment” detection from soccer game using two official examples:
-
LLaVA post‑training example
cosmos-reason1/examples/post_training_llava/README.md at main · nvidia-cosmos/cosmos-reason1 · GitHub -
Hugging Face SFT example
https://github.com/nvidia-cosmos/cosmos-reason1/blob/main/examples/post_training_hf/README.md
Both examples run successfully with the reference datasets provided in the repos. However, when I switch to my custom dataset (videos + annotations) for soccer key moments and adapt the preprocessing to the required format, I encounter errors at training time.
I’ve attached:
-
A sample dataset of 2 videos (video + annotation) from custom dataset
-
The config file
-
The error logs (error is same in both cases)
-
sample_dataset.zip (99.3 MB)
-
cosmos_training_error.txt (83.8 KB)
-
sft_config.txt.txt (1.5 KB)
Use Case
We process live video streams to detect key moments in sports events—e.g., goals, yellow/red cards, shots on target, penalty shots, and goal‑line events; similar significant moments for Soccer game. On detection, we immediately extract short clips around those moments.
Training Objective
Fine‑tune Cosmos Reason1 with a soccer‑specific dataset and deploy via the VSS blueprint to achieve:
-
Higher precision/recall for key‑moment detection
-
More accurate start/end timestamps for each detected moment
-
Higher‑quality descriptions that include player names, jersey numbers, and team names
Working
- Both example pipelines (LLaVA post‑training and HF SFT) succeed as‑is with the example datasets from the repository.
Not Working
-
Using my custom dataset, the runs fail during training.
-
I attempted:
-
Adapting my annotations to the LLaVA example’s expected format
-
Using the HF SFT route with a modified preprocessing script to match the SFT schema
-
-
In both cases, training stops with error
Hardware: 4XA100
Queries
-
Dataset/Media Guidelines
-
Any guidelines specific to preparation of the video dataset for finetuning cosmos model
-
Are there recommended ranges for clip duration, resolution, and FPS for Cosmos Reason1 video training?
-
Any constraints for aspect ratio or max/min frames per sample that the LLaVA or SFT pipelines implicitly assume?
-
-
Target Response Schema
-
For annotations/targets, should we train Cosmos to directly emit our final JSON (with fields like event_type, start_timestamp, end_timestamp, description), or is it better to train for a natural‑language description only and let VSS retrieval/post‑processing assemble the final JSON downstream?
-
If JSON is preferred, could you share a canonical example that has worked well in practice with the provided training scripts?
-
-
Audio
-
Our use case benefits from commentary audio for identifying player/team names to include in the description.
-
Does the current post‑training pipeline leverage audio alongside video frames/text prompts? If so, how should audio features/streams be provided and annotated?
-
Or let VSS take care of the audio part later when plugged with cosmos model finetuned on just video clips
-
-
Reasoning Signals in Annotations
-
Given Cosmos Reason1’s reasoning capability, should our annotations include explicit chain‑of‑thought or reasoning fields (both in prompts and targets)?
-
If yes, are there formatting guidelines or examples for integrating reasoning steps without destabilizing training or causing formatting errors?
-