Dual RTX 4090 with distributed training


We are willing to buy a workstation to train machine learning computer vision. Considering that we are a small company we are considering buying a workstation with 2x RTX 4090 in it. I am a deep learning engineer but so far I have only worked with a single GPU so I have some questions on distributed training.

So far I have read that Nvidia removed NVlink for 40 series, so that means that data is using PCIE to be transfered if I am right. I saw some tutorials on the internet about distributed training (naive model parallelism => dividing the model in two parts, each part on a GPU).

I was wondering :

  • is it possible to transfer a tensor from gpu_0 to gpu_1 during training using torch .to(device) (with 4090s)
  • does it make sense to do that ? By this I mean : will it still be efficient without NVLink, or this a waste of time during training ?

Also, I came across the term “P2P” on some forum about 4090, can someone clarify what does it means exactly ?

As far as I understand the naive model parallelism makes the GPUs work one at a time, I was wondering if other kind of pipeline of distributed training exists with 4090 considering no NVLink ?

Thank you !

I suspect you might get better help asking questions like this on a pytorch forum, such as discuss.pytorch.org


I havn’t tried that syntax myself.

It will move from GPU to GPU roughly at the speed afforded by your PCIE connection. To do a comparison with NVLink, more specifics would be needed, such as what is the bandwidth of the NVLink you are imagining (since none exist for 4090)? But to a first order approximation, the impact (at least for the data transfer) could be considered proportional to the bandwidth of PCIE on your proposed platform vs. the bandwidth of NVLink on your imagined platform.

certainly if you can arrange to not be moving tensors around (and I think pytorch DDP would arrange for that, but I’m definitely not an expert) its probably more efficient than moving tensors around. Even DDP will have to move some data at certain points.

If you google for GPUDirect Peer to Peer you will find many discussions. At a basic level it allows for data transfer from one gpu to another with the smallest amount of interaction from the CPU, and if you had NVLink, it would be necessary to use P2P to take advantage of NVLink.

This isn’t really a CUDA question, and there are multiple possibilities for work organization in distributed training. I believe pytorch implements more than one kind of parallelism.

P2P is also not supported, at least for the 4090, so if this is important or a requirement, it needs to be considered.