Tutorial: Running Cosmos-1.0-Diffusion-7B on Two NVIDIA Orin AGX Devices

Hi everyone,

I’ve written a tutorial on running Cosmos-1.0-Diffusion-7B Text2World in parallel on two NVIDIA Orin AGX devices. While the focus is on the 7B Text2World model, the steps should also apply to Video2World and the 14B versions with some adjustments.
The tutorial covers:

  • Environment setup for both devices.
  • Parallelized configuration.

If you’re working with these models or similar high-performance tasks, you might find it helpful:
🔗 ParallelCosmos: Running Cosmos 1.0 on Two NVIDIA Orin AGX

I’d love your feedback or suggestions for improvement!

Best regards,
Andrei

1 Like

Hi @andrei.ciobanu1984

Following the official guide, single generation (Cosmos/cosmos1/models/diffusion/README.md at main · NVIDIA/Cosmos · GitHub), we run Cosmos-1.0-Diffusion-7B Text2World using six NVIDIA 6000 ada GPU cards on a server.

But out-of-memory happened.

Any comments to fix this problem? Many thanks.

Error message:
"
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB. GPU 0 has a total capacity of 47.51 GiB of which 8.81 MiB is free. Process 8224 has 47.48 GiB memory in use. Of the allocated memory 46.99 GiB is allocated by PyTorch, and 12.60 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (CUDA semantics — PyTorch 2.6 documentation)
"

Hi @jmren,

You can try offloading some parts of the model by using the --offload_ arguments. I can run on a single A6000 by offloading the prompt upsampler & guardrails & T5 encoder (38.5 GB vram for 7B model).
In your particular case I would also have a look how to run the model on multi-gpu system with torchrun: Cosmos/cosmos1/models/diffusion/nemo/inference/README.md at main · NVIDIA/Cosmos · GitHub

Hi @andrei.ciobanu1984

Yes, it works on a single A6000 by using the --offload arguments.
But we still want to know how it works on multi-gpu situation.
torchrun may be a good start point. Thanks a lot.