Fine‑tuning Cosmos Reason1 on a custom soccer “key‑moments” dataset - errors with LLaVA & HF SFT examples

vadloori.rajashekar · January 23, 2026, 2:06pm

Context

We are attempting to fine‑tune Cosmos Reason1 for sports “key‑moment” detection from soccer game using two official examples:

LLaVA post‑training example
cosmos-reason1/examples/post_training_llava/README.md at main · nvidia-cosmos/cosmos-reason1 · GitHub
Hugging Face SFT example
https://github.com/nvidia-cosmos/cosmos-reason1/blob/main/examples/post_training_hf/README.md

Both examples run successfully with the reference datasets provided in the repos. However, when I switch to my custom dataset (videos + annotations) for soccer key moments and adapt the preprocessing to the required format, I encounter errors at training time.

I’ve attached:

A sample dataset of 2 videos (video + annotation) from custom dataset
The config file
The error logs (error is same in both cases)
sample_dataset.zip (99.3 MB)
cosmos_training_error.txt (83.8 KB)
sft_config.txt.txt (1.5 KB)

Use Case

We process live video streams to detect key moments in sports events—e.g., goals, yellow/red cards, shots on target, penalty shots, and goal‑line events; similar significant moments for Soccer game. On detection, we immediately extract short clips around those moments.

Training Objective

Fine‑tune Cosmos Reason1 with a soccer‑specific dataset and deploy via the VSS blueprint to achieve:

Higher precision/recall for key‑moment detection
More accurate start/end timestamps for each detected moment
Higher‑quality descriptions that include player names, jersey numbers, and team names

Working

Both example pipelines (LLaVA post‑training and HF SFT) succeed as‑is with the example datasets from the repository.

Not Working

Using my custom dataset, the runs fail during training.
I attempted:
- Adapting my annotations to the LLaVA example’s expected format
- Using the HF SFT route with a modified preprocessing script to match the SFT schema
In both cases, training stops with error

Hardware: 4XA100

Queries

Dataset/Media Guidelines
- Any guidelines specific to preparation of the video dataset for finetuning cosmos model
- Are there recommended ranges for clip duration, resolution, and FPS for Cosmos Reason1 video training?
- Any constraints for aspect ratio or max/min frames per sample that the LLaVA or SFT pipelines implicitly assume?
Target Response Schema
- For annotations/targets, should we train Cosmos to directly emit our final JSON (with fields like event_type, start_timestamp, end_timestamp, description), or is it better to train for a natural‑language description only and let VSS retrieval/post‑processing assemble the final JSON downstream?
- If JSON is preferred, could you share a canonical example that has worked well in practice with the provided training scripts?
Audio
- Our use case benefits from commentary audio for identifying player/team names to include in the description.
- Does the current post‑training pipeline leverage audio alongside video frames/text prompts? If so, how should audio features/streams be provided and annotated?
- Or let VSS take care of the audio part later when plugged with cosmos model finetuned on just video clips
Reasoning Signals in Annotations
- Given Cosmos Reason1’s reasoning capability, should our annotations include explicit chain‑of‑thought or reasoning fields (both in prompts and targets)?
- If yes, are there formatting guidelines or examples for integrating reasoning steps without destabilizing training or causing formatting errors?

junshengy · January 26, 2026, 6:02am

Both Llava datasets format and Hugging Face datasets format are supported.
Your labeled dataset looks fine.

The error log you provided indicates a problem with the configuration file.

[rank0]:  File "/home/shadeform/cosmos-reason1/examples/post_training_llava/.venv/lib/python3.10/site-packages/cosmos_rl/policy/model/qwen2_5_vl/__init__.py", line 882, in _process_vision_embeddings
[rank0]:    vision_embeds.shape[0] == n_tokens
[rank0]:AssertionError: vision_embeds.shape[0] must be equal to n_tokens

Refer to this link

Please refer to this link for an introduction to fps/tokens/max_pixels.

As long as the format of the annotation can be recognized, it is acceptable. I suggest you use a consistent format to identify text. like this.

[
{
  "id": "clip_016",
  "video": "clip_016.mp4",
  "conversations": [
    {
      "from": "human",
      "value": "<video>\nYou are a soccer key-moment detection assistant. Detect only: Goal, Yellow Card, Red Card, Shot on Target, or No Event.\nReturn JSON as:\n{ 'all_events': [ { 'event': '<Goal|Yellow Card|Red Card|Shot on Target|No Event>', 'start_time': <float>, 'end_time': <float>, 'description': '<=100 chars, include team(s), player name(s) and jersey number(s) if visible>' } ] }.\nIf there is no event, use event = 'No Event' and cover the entire clip duration.\n"
    },
    {
      "from": "gpt",
      "value": "{ 'all_events': [ { 'event': 'No Event', 'start_time': 0.0, 'end_time': 90.0, 'description': 'No Goal, Yellow Card, Red Card or Shot on Target; only regular play between the two teams.' } ] }"
    }
  ]
},
{
  "id": "clip_020",
  "video": "clip_020.mp4",
  "conversations": [
    {
      "from": "human",
      "value": "<video>\nYou are a soccer key-moment detection assistant. Detect only: Goal, Yellow Card, Red Card, Shot on Target, or No Event.\nReturn JSON as:\n{ 'all_events': [ { 'event': '<Goal|Yellow Card|Red Card|Shot on Target|No Event>', 'start_time': <float>, 'end_time': <float>, 'description': '<=100 chars, include team(s), player name(s) and jersey number(s) if visible>' } ] }.\n"
    },
    {
      "from": "gpt",
      "value": "{ 'all_events': [ { 'event': 'Goal', 'start_time': 50.0, 'end_time': 70.0, 'description': 'Bodø/Glimt #3 Ole Didrik Blomberg scores vs Juventus from left side, assisted by Kasper Høgh.' } ] }"
    }
  ]
}
]

Cosmos can only process video. If you want to fine-tune audio ASR models, please refer to this Riva model cookbook.

You can find similar ShareGPT format datasets with COT/reasoning on Hugging Faces. These datasets don’t require special tokens; they just need something like

{
  "id": "clip_101",
  "video": "clip_101.mp4",
  "conversations": [
    {
      "from": "human",
      "value": "<video>\nYou are a soccer key-moment detection assistant. Detect only: Goal, Yellow Card, Red Card, Shot on Target, or No Event, and return the result in JSON."
    },
    {
      "from": "gpt",
      "value": "Reasoning: I watch the entire video carefully. Around 35 seconds, Team A builds an attack on the right wing, crosses the ball into the penalty area, and player #9 from Team A heads the ball toward the goal. The ball completely crosses the goal line between the posts and under the crossbar, and the referee clearly signals a valid goal. There are no separate yellow or red card incidents, and no additional shots on target beyond this scoring attempt.\nFinal answer: { 'all_events': [ { 'event': 'Goal', 'start_time': 32.0, 'end_time': 45.0, 'description': 'Team A #9 scores a header vs Team B after a right-wing cross; no other key events.' } ] }"
    }
  ]
}

I recommend you try the latest Cosmos-Reason 2, which is a cookbook and also a good reference for Cosmos-Reason 1.

Topic		Replies	Views
Maximize Robotics Performance by Post-Training NVIDIA Cosmos Reason Technical Blog cosmos	3	104	September 22, 2025
Unreliable alert generation for weapon-like object detection in uploaded video Visual AI Agent blueprints , cosmos	12	128	November 25, 2025
NVIDIA Cosmos Announcements at CES 2026 Announcements cosmos	0	107	January 6, 2026
NVIDIA Cosmos Cookbook: Featuring Cosmos Transfer 2.5, Predict 2.5, and Reason 1 Recipes Announcements cosmos	0	81	December 11, 2025
NVIDIA Cosmos Cookbook: Featuring Cosmos Transfer 2.5, Predict 2.5, and Reason 1 Recipes Announcements cosmos	0	32	December 11, 2025
Trying to use cosmos-nemotron-34b but not getting timestamps in output Models nim , cosmos	2	108	November 17, 2025
Curating Synthetic Datasets to Train Physical AI Models with NVIDIA Cosmos Reason Technical Blog cosmos	1	81	May 19, 2025
Alibaba Cloud PAI, 이제 NVIDIA Cosmos Reason-1 원클릭 배포 및 사용 지원 Technical Blog - South Korea cosmos	1	71	June 13, 2025
NVIDIA Cosmos on AGX Orin issue Jetson AGX Orin cosmos	10	281	April 1, 2025
Alibaba Cloud PAI, 이제 NVIDIA Cosmos Reason-1 제공 Technical Blog - South Korea cosmos	1	59	June 9, 2025

Fine‑tuning Cosmos Reason1 on a custom soccer “key‑moments” dataset - errors with LLaVA & HF SFT examples

Related topics