Dual RTX 4090 with distributed training

Monkey.py · April 12, 2024, 4:19pm

Hi,

We are willing to buy a workstation to train machine learning computer vision. Considering that we are a small company we are considering buying a workstation with 2x RTX 4090 in it. I am a deep learning engineer but so far I have only worked with a single GPU so I have some questions on distributed training.

So far I have read that Nvidia removed NVlink for 40 series, so that means that data is using PCIE to be transfered if I am right. I saw some tutorials on the internet about distributed training (naive model parallelism => dividing the model in two parts, each part on a GPU).

I was wondering :

is it possible to transfer a tensor from gpu_0 to gpu_1 during training using torch .to(device) (with 4090s)
does it make sense to do that ? By this I mean : will it still be efficient without NVLink, or this a waste of time during training ?

Also, I came across the term “P2P” on some forum about 4090, can someone clarify what does it means exactly ?

As far as I understand the naive model parallelism makes the GPUs work one at a time, I was wondering if other kind of pipeline of distributed training exists with 4090 considering no NVLink ?

Thank you !

Robert_Crovella · April 12, 2024, 6:19pm

I suspect you might get better help asking questions like this on a pytorch forum, such as discuss.pytorch.org

Yes

I havn’t tried that syntax myself.

It will move from GPU to GPU roughly at the speed afforded by your PCIE connection. To do a comparison with NVLink, more specifics would be needed, such as what is the bandwidth of the NVLink you are imagining (since none exist for 4090)? But to a first order approximation, the impact (at least for the data transfer) could be considered proportional to the bandwidth of PCIE on your proposed platform vs. the bandwidth of NVLink on your imagined platform.

certainly if you can arrange to not be moving tensors around (and I think pytorch DDP would arrange for that, but I’m definitely not an expert) its probably more efficient than moving tensors around. Even DDP will have to move some data at certain points.

If you google for GPUDirect Peer to Peer you will find many discussions. At a basic level it allows for data transfer from one gpu to another with the smallest amount of interaction from the CPU, and if you had NVLink, it would be necessary to use P2P to take advantage of NVLink.

This isn’t really a CUDA question, and there are multiple possibilities for work organization in distributed training. I believe pytorch implements more than one kind of parallelism.

rs277 · April 12, 2024, 8:18pm

P2P is also not supported, at least for the 4090, so if this is important or a requirement, it needs to be considered.

nitacornelia · May 31, 2024, 7:31am

Hi,

Do you have any updates on this dual RTX 4090 workstation? Have you managed to do a distributed training on it? I am considering building a dual RTX 4090 computer and I have the same doubts.

Thanks

Topic		Replies	Views
cuda 4.0rc2 cudaMemcpyPeer(Async) performance issues CUDA Programming and Performance	11	13132	May 3, 2011
CUDA 4.0 CUDA Programming and Performance	63	507719	March 28, 2013
Using GTX 590 cards for CUDA SLI cards under CUDA? CUDA Programming and Performance	37	14453	April 2, 2012
GPU to GPU transfers most effective method? CUDA Programming and Performance	27	38371	March 3, 2011
About PGI Fortran and CUDA 4.0 Legacy PGI Compilers	24	13144	August 12, 2011
How to communicate beetween two GPUs Tesla D870 : two tesla C870 GPUs CUDA Programming and Performance	2	1644	April 10, 2008
Issues on P2P transfer with Tesla K80 CUDA Programming and Performance	4	2068	November 15, 2015
Best solution for maximizing bandwidth? More then 5.7G H->D bandwidth except Tesla CUDA Programming and Performance	24	11234	December 26, 2008
Data transfer between GPU of a workstation CUDA Programming and Performance	2	365	April 16, 2024
Bandwidth disparity between Host-Device-Device-Host CUDA Programming and Performance	2	955	August 24, 2011

Dual RTX 4090 with distributed training

Related topics