Tutorial: Running Cosmos-1.0-Diffusion-7B on Two NVIDIA Orin AGX Devices

andrei.ciobanu1984 · January 19, 2025, 1:43pm

Hi everyone,

I’ve written a tutorial on running Cosmos-1.0-Diffusion-7B Text2World in parallel on two NVIDIA Orin AGX devices. While the focus is on the 7B Text2World model, the steps should also apply to Video2World and the 14B versions with some adjustments.
The tutorial covers:

Environment setup for both devices.
Parallelized configuration.

If you’re working with these models or similar high-performance tasks, you might find it helpful:
🔗 ParallelCosmos: Running Cosmos 1.0 on Two NVIDIA Orin AGX

I’d love your feedback or suggestions for improvement!

Best regards,
Andrei

jmren · March 11, 2025, 3:35am

Hi @andrei.ciobanu1984

Following the official guide, single generation (Cosmos/cosmos1/models/diffusion/README.md at main · NVIDIA/Cosmos · GitHub), we run Cosmos-1.0-Diffusion-7B Text2World using six NVIDIA 6000 ada GPU cards on a server.

But out-of-memory happened.

Any comments to fix this problem? Many thanks.

Error message:
"
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB. GPU 0 has a total capacity of 47.51 GiB of which 8.81 MiB is free. Process 8224 has 47.48 GiB memory in use. Of the allocated memory 46.99 GiB is allocated by PyTorch, and 12.60 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (CUDA semantics — PyTorch 2.6 documentation)
"

andrei.ciobanu1984 · March 11, 2025, 8:13am

Hi @jmren,

You can try offloading some parts of the model by using the --offload_ arguments. I can run on a single A6000 by offloading the prompt upsampler & guardrails & T5 encoder (38.5 GB vram for 7B model).
In your particular case I would also have a look how to run the model on multi-gpu system with torchrun: Cosmos/cosmos1/models/diffusion/nemo/inference/README.md at main · NVIDIA/Cosmos · GitHub

jmren · March 11, 2025, 8:26am

Hi @andrei.ciobanu1984

Yes, it works on a single A6000 by using the --offload arguments.
But we still want to know how it works on multi-gpu situation.
torchrun may be a good start point. Thanks a lot.

Topic		Replies	Views
How to run cosmos 1.0 7b text2world model on six 6000 Ada GPU Cards (each of which has 48GB memory) Models cosmos	0	50	March 11, 2025
NVIDIA Cosmos on AGX Orin issue Jetson AGX Orin cosmos	10	131	April 1, 2025
Nvidia Cosmos running on Jetson Jetson Projects jetson , cosmos	13	430	January 13, 2025
Cosmos on RTX 5080 Linux rtx , cosmos	0	50	June 2, 2025
Nemo > Canary 1B > RuntimeError: CUDA driver error: out of memory Jetson AGX Orin cuda , nemo , generative_ai , speech	10	223	January 22, 2025
NanoOwl Generative AI example memory error AGX Orin Jetson AGX Orin generative_ai	3	552	December 5, 2023
How to implement distributed reasoning Jetson Orin Nano generative_ai	4	33	May 28, 2025
OutOfMemoryError CUDA Programming and Performance cuda , pytorch	1	813	March 13, 2024
Jetson Orin Nano Super (DevKit) - Container Killing itself Jetson Orin Nano containers	3	89	January 17, 2025
LLaMa 2 LLMs w/ NVIDIA Jetson and textgeneration-web-ui Jetson Projects generative_ai	86	24506	May 10, 2024

Tutorial: Running Cosmos-1.0-Diffusion-7B on Two NVIDIA Orin AGX Devices

Related topics