Fine‑tuning Cosmos Reason1 on a custom soccer “key‑moments” dataset - errors with LLaVA & HF SFT examples

Context

We are attempting to fine‑tune Cosmos Reason1 for sports “key‑moment” detection from soccer game using two official examples:

  1. LLaVA post‑training example
    cosmos-reason1/examples/post_training_llava/README.md at main · nvidia-cosmos/cosmos-reason1 · GitHub

  2. Hugging Face SFT example
    https://github.com/nvidia-cosmos/cosmos-reason1/blob/main/examples/post_training_hf/README.md

Both examples run successfully with the reference datasets provided in the repos. However, when I switch to my custom dataset (videos + annotations) for soccer key moments and adapt the preprocessing to the required format, I encounter errors at training time.

I’ve attached:

Use Case

We process live video streams to detect key moments in sports events—e.g., goals, yellow/red cards, shots on target, penalty shots, and goal‑line events; similar significant moments for Soccer game. On detection, we immediately extract short clips around those moments.

Training Objective

Fine‑tune Cosmos Reason1 with a soccer‑specific dataset and deploy via the VSS blueprint to achieve:

  • Higher precision/recall for key‑moment detection

  • More accurate start/end timestamps for each detected moment

  • Higher‑quality descriptions that include player names, jersey numbers, and team names

Working

  • Both example pipelines (LLaVA post‑training and HF SFT) succeed as‑is with the example datasets from the repository.

Not Working

  • Using my custom dataset, the runs fail during training.

  • I attempted:

    • Adapting my annotations to the LLaVA example’s expected format

    • Using the HF SFT route with a modified preprocessing script to match the SFT schema

  • In both cases, training stops with error

Hardware: 4XA100

Queries

  1. Dataset/Media Guidelines

    • Any guidelines specific to preparation of the video dataset for finetuning cosmos model

    • Are there recommended ranges for clip duration, resolution, and FPS for Cosmos Reason1 video training?

    • Any constraints for aspect ratio or max/min frames per sample that the LLaVA or SFT pipelines implicitly assume?

  2. Target Response Schema

    • For annotations/targets, should we train Cosmos to directly emit our final JSON (with fields like event_type, start_timestamp, end_timestamp, description), or is it better to train for a natural‑language description only and let VSS retrieval/post‑processing assemble the final JSON downstream?

    • If JSON is preferred, could you share a canonical example that has worked well in practice with the provided training scripts?

  3. Audio

    • Our use case benefits from commentary audio for identifying player/team names to include in the description.

    • Does the current post‑training pipeline leverage audio alongside video frames/text prompts? If so, how should audio features/streams be provided and annotated?

    • Or let VSS take care of the audio part later when plugged with cosmos model finetuned on just video clips

  4. Reasoning Signals in Annotations

    • Given Cosmos Reason1’s reasoning capability, should our annotations include explicit chain‑of‑thought or reasoning fields (both in prompts and targets)?

    • If yes, are there formatting guidelines or examples for integrating reasoning steps without destabilizing training or causing formatting errors?

Both Llava datasets format and Hugging Face datasets format are supported.
Your labeled dataset looks fine.

The error log you provided indicates a problem with the configuration file.

[rank0]:  File "/home/shadeform/cosmos-reason1/examples/post_training_llava/.venv/lib/python3.10/site-packages/cosmos_rl/policy/model/qwen2_5_vl/__init__.py", line 882, in _process_vision_embeddings
[rank0]:    vision_embeds.shape[0] == n_tokens
[rank0]:AssertionError: vision_embeds.shape[0] must be equal to n_tokens

Refer to this link

Please refer to this link for an introduction to fps/tokens/max_pixels.

As long as the format of the annotation can be recognized, it is acceptable. I suggest you use a consistent format to identify text. like this.

[
{
  "id": "clip_016",
  "video": "clip_016.mp4",
  "conversations": [
    {
      "from": "human",
      "value": "<video>\nYou are a soccer key-moment detection assistant. Detect only: Goal, Yellow Card, Red Card, Shot on Target, or No Event.\nReturn JSON as:\n{ 'all_events': [ { 'event': '<Goal|Yellow Card|Red Card|Shot on Target|No Event>', 'start_time': <float>, 'end_time': <float>, 'description': '<=100 chars, include team(s), player name(s) and jersey number(s) if visible>' } ] }.\nIf there is no event, use event = 'No Event' and cover the entire clip duration.\n"
    },
    {
      "from": "gpt",
      "value": "{ 'all_events': [ { 'event': 'No Event', 'start_time': 0.0, 'end_time': 90.0, 'description': 'No Goal, Yellow Card, Red Card or Shot on Target; only regular play between the two teams.' } ] }"
    }
  ]
},
{
  "id": "clip_020",
  "video": "clip_020.mp4",
  "conversations": [
    {
      "from": "human",
      "value": "<video>\nYou are a soccer key-moment detection assistant. Detect only: Goal, Yellow Card, Red Card, Shot on Target, or No Event.\nReturn JSON as:\n{ 'all_events': [ { 'event': '<Goal|Yellow Card|Red Card|Shot on Target|No Event>', 'start_time': <float>, 'end_time': <float>, 'description': '<=100 chars, include team(s), player name(s) and jersey number(s) if visible>' } ] }.\n"
    },
    {
      "from": "gpt",
      "value": "{ 'all_events': [ { 'event': 'Goal', 'start_time': 50.0, 'end_time': 70.0, 'description': 'Bodø/Glimt #3 Ole Didrik Blomberg scores vs Juventus from left side, assisted by Kasper Høgh.' } ] }"
    }
  ]
}
]

Cosmos can only process video. If you want to fine-tune audio ASR models, please refer to this Riva model cookbook.

You can find similar ShareGPT format datasets with COT/reasoning on Hugging Faces. These datasets don’t require special tokens; they just need something like

{
  "id": "clip_101",
  "video": "clip_101.mp4",
  "conversations": [
    {
      "from": "human",
      "value": "<video>\nYou are a soccer key-moment detection assistant. Detect only: Goal, Yellow Card, Red Card, Shot on Target, or No Event, and return the result in JSON."
    },
    {
      "from": "gpt",
      "value": "Reasoning: I watch the entire video carefully. Around 35 seconds, Team A builds an attack on the right wing, crosses the ball into the penalty area, and player #9 from Team A heads the ball toward the goal. The ball completely crosses the goal line between the posts and under the crossbar, and the referee clearly signals a valid goal. There are no separate yellow or red card incidents, and no additional shots on target beyond this scoring attempt.\nFinal answer: { 'all_events': [ { 'event': 'Goal', 'start_time': 32.0, 'end_time': 45.0, 'description': 'Team A #9 scores a header vs Team B after a right-wing cross; no other key events.' } ] }"
    }
  ]
}

I recommend you try the latest Cosmos-Reason 2, which is a cookbook and also a good reference for Cosmos-Reason 1.