Multi GPU scaling using 4 A5500

Hi,
I’m curious to know whether I can use four RTX A5500 together to scale the vram to 96 gb ram to fine tune LLM?

Configuration:
Motherboard is ASUS® Pro WS WRX80E-SAGE SE WIFI II
and
CPU is AMD Ryzen Threadripper PRO 5955WX 16 Core CPU (4.5GHz, 64MB CACHE).

Thanks,
Prasanta

Hi @MarkusHoHo ,
Can you please help here?

Thanks,
Prasanta

Hi @prasanta.panja! Welcome to the NVIDIA developer forums.

I’m curious to know whether I can use four RTX A5500 together […] to fine tune LLM?

The short answer to that part is “Yes”. And if programmed correctly you can also benefit from all available VRAM on all GPUs.

But I suspect your thinking is that you would like to load all your LLM parameters into one continuous chunk of VRAM and let your model run freely to fine-tune it. That is not possible.

You will have to use whatever your underlying DL framework offers you in terms of memory handling mechanisms and workload parallelization and distribute the work over your GPUs. This article is a bit older but nevertheless still valid Unified Memory for CUDA Beginners | NVIDIA Technical Blog and explains the fundamentals of CUDA unified memory. And on the CUDA forums there is for example this question (and solution) which might help you understand the way multi-GPU programming works.

If you have more specific questions I recommend searching in the CUDA forum category and check out articels about multi-GPU usage for large DL models.

I hope that helps.

1 Like

Many Thanks. This is helpful.

Thanks,
Prasanta

@MarkusHoHo sorry to bother you again. Nvidia page (RTX A5500 Graphics Card | NVIDIA) says that by using NVIDIA NVLink Bridge, I can get up to 112 gigabytes per second (GB/s) of bandwidth and a combined 48GB of GDDR6 memory to tackle memory-intensive workloads. So, I thought at least 48GB combined memory is achievable. So, do you think that it won’t be applicable for LLM fine tuning ? I’ll be mostly using Pytorch as deep learning programming framework.

Thanks,
Prasanta

Yes, it is applicable. You will be able to utilize all memory for your Deep Learning models. And you will have the bandwidth improvements of NVLink.

PyTorch can run with CUDA and as such you will be able to use memory management that way.

The only thing to understand is that multiple GPUs and their VRAM act as if you would have one GPU only with much bigger VRAM. Workloads need to be distributed an memory usage will be batched. But depending on the framework and how your code/model is designed, it might even be transparent to you or require only small adjustments to your code.

I am not a CUDA or multi-GPU DL expert, but there are plenty of tutorials “out there” that can get you started.

1 Like

Thank you very much for having patience with my queries and clarifying doubts. This conversation was really helpful.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.