Hi,
I have a question regarding an apparent inconsistency between the Cosmos Cookbook best practices and the augmentation procedure described in the Cosmos Technical Report in section
6.2. Cosmos-Transfer2.5 for Robot Policy Learning
6.2.2. Data Augmentation Strategy:
“Our augmentation strategy applies global edge control across the entire image, while restricting blur control to robot pixels.”
Since blur control corresponds to vis control, this means the report is explicitly masking vis control to apply it only on robot pixels.
However, the Cosmos Cookbook best practices also state:
Vis Masking (Avoid): Masking Vis control is known to cause visual hallucinations and is generally discouraged. Use Vis globally with a low weight instead.
❌ Mask Vis Control: Avoid masking Vis control as it is known to cause hallucinations.
So my question is:
Why is vis/blur control masked in the technical report despite the best-practice guidance advising against masking it?
Specifically:
Is this robotics‑augmentation setup an intentional exception where masking vis is actually appropriate?
If so, how were hallucinations mitigated in this pipeline?
Or did potential hallucinations not meaningfully affect the final robot policy training outcomes?
Any clarification on how this design choice aligns with (or differs from) the general best practices would be very helpful.
Thanks!