Question: Output Quality When Reducing Resolution from 720 to 480 with Cosmos-Transfer2.5

Hi,
I noticed that inference time improves significantly when reducing the resolution parameter from the default 720 to 480.
However, I am unsure how much this impacts output quality in practice.

Observed Behavior
When processing the example video from the COSMOS Cookbook https://nvidia-cosmos.github.io/cosmos-cookbook/gallery/robotics_inference.html
using resolution 480 (all other parameters unchanged), I observed a hallucination around the 8‑second mark, where the tomato is dropped and then picked up again.
This artifact does not appear when running the same video at resolution 720.

Questions

  1. Is it expected that reducing resolution from 720 → 480 can significantly degrade output quality?

  2. Are there recommended minimum resolution settings for stable behavior? When my input videos have an input resolution of 640x480 would the “480” resolution be an appropriate setting? In other words: should the model resolution match the input resolution height?

  3. Is the hallucination I observed likely just an isolated case, or is this behavior generally expected at lower resolutions?

Steps to Reproduce
Spec:

{
    "name": "kitchen_example",
    "prompt_path": "prompt.txt",
    "video_path": "kitchen_stove_input.mp4",
    "guidance": 7,
    "seed": 1,
    "resolution": "480",
    "edge": {
        "control_weight": 1.0
    }
}

Prompt:

This scene depicts a photo realistic luxury kitchen with high end professional finishes and lighting. ALL The kitchen cabinets are all highly polished bright red panels with stainless steel accents and pulls. The kitchen counters are stainless steel. The kitchen walls and backsplash are all white subway tile. The kitchen contains an expensive double door stainless steel refrigerator, a stainless steel microwave, a stainless steel oven, a stainless steel coffee machine, a stainless steel toaster, a stainless steel stove top, a stainless steel sink, and stainless steel pots. Standing in the kitchen is a humanoid robot. The robot is made of white polished panels with black accents. The camera is fixed and steady. The robot is at a kitchen stainless steel stove picking up a red cooking pot lid with his left hand and lifting it in the air. The robot is picking up two tomatoes with his right hand and putting them inside the red pot. There is steam coming out of the pot.

Input video:

Output video:

Reducing resolution can absolutely reduce output quality. Its difficult to recommend exact settings because its highly context sensitive, meaning that for some subject matter the model may have a more robust understanding and outputs, vs a different scene.

Usually when tackling quality issues I try various variations in control modalities, such as changing the weights on edges, depth, segmentation or blur.

Post training can address this to tune it to your specific needs:

Try experimenting some more to see if you can find a sweet spot where your results are more robust.