Standard nVidia CUDA tests fail with dual RTX 4090 Linux box

kinred · April 1, 2023, 1:40pm

We can confirm that with driver 525.105.17 the locking problem is fixed and UVA (Unified Virtual Adressing) also seems to work with 2x RTX 4090 without error on AMD EPYC and Threadripper. With working UVA also NCCL works → multi GPU training with Pytorch and Tensorflow work.

Finally 2x RTX 4090 can be used for deep learning training, the missing P2P performance probably hinders scaling with more than two RTX 4090 because of bad all gather and all reduce performance… but at least now they are useable.

bifwvxz38872 · April 2, 2023, 12:10pm

This is great to hear!

So now the question is, how can I downgrade from 530 to 525 properly on ubuntu 20.04?

Or any idea when the change will make it into the 530 branch?

yash.khurana · May 22, 2023, 11:27am

Hey, can you explain what you said? You said it’s usable? Why is the performance bad then?

kinred · May 22, 2023, 11:31am

I said that it is useable (it does not crash or hang anymore) and it calculates correct results (which was also not the case in older versions). I did not say that it is good or fast and already doubted that it makes sense to use more than two RTX 4090s. What are your experiences with the performance?

yash.khurana · May 22, 2023, 11:34am

Can you explain why it’s not good?

kinred · May 22, 2023, 11:36am

The missing P2P performance probably hinders scaling with more than two RTX 4090 because of bad all gather and all reduce performance.

kasemo · May 22, 2023, 11:54am

The missing functionality makes it as slow as CPU.

If you have the $ for 2x 4090, you should pick up 2x A5500 instead. The memory > the processing for inference.

< Before

< After

The After pic destroys the performance of 2x4090. Not even close.

You need a completely un-hobbled GPU, including p2p to do anything. Unless you are using multiple systems and building your own p2p, 4090 is a total waste of time.

note on the bracket: I changed cases later + bracket is no longer necessary. Old case needed it. LianLi O11XL fits everything without issue.

anto_4090 · June 19, 2023, 5:22pm

Has there been any progress on this?

I can get 2x 4090s working via NCCL_P2P_DISABLE=1. But moving to 3x 4090s barely improves throughput over 2x 4090s.

Gimel12 · June 23, 2023, 3:55pm

Hi,

This is really interesting, I can confirm that running different training jobs we are having the same issues described here.

One thing I noticed is that when I train for example a LLM models or fine-tuning them, it scale fine between GPUs using multi GPUs 4090, we have tested with 4x GPUs, but when running a vision model like resnet50 the system crash or lose performance, for example: It start the resnet50 training and after 2 minutes performance goes down like 80%, the GPUs are being used at 100% at the beginning using all the power of the GPU and then after 2 minutes wattage consumption goes to 120w or less, and this is not related to hardware since we test multiple system and scenarios, it seems that is all related to the P2P being locked on Gforce cards.

Can you please @abchauhan let us know if this is permanent and what it means for workstations then with multi GPUs, so we cannot train DL models using multi 4090s and we must buy Quadro GPUs?

Also in Windows 11 then for 3D rendering this P2P will be enable and it will not be affected or how it works in that case.

Please @abchauhan confirming that is important for us to make the right business decision.

Thanks

ron42 · June 30, 2023, 10:20pm

@kinred How did you verify UVA and NCCL is working?

kinred · July 1, 2023, 8:10am

By running following tests:

They are not comprehensive, but did not work at all when this thread was started.

ron42 · July 5, 2023, 10:06pm

Thank you!

root39 · January 23, 2024, 6:51am

This is still affecting me on driver 545.29.06, kernel 6.7.0. I’m getting the same output on as OP with simpleP2P.

kj140717 · April 17, 2024, 4:00am

Has there been any updates to this?
On my multi-4090 AMD system, I can train with multi-gpu. But when deploying an LLM across multiple gpus, it raises device-side index assertion error. If I turn off all virtualizations, the system hangs when attempting to infer from multiple gpus.
I believe this is related to the issue in this thread as another system of mine with multi-3090 has no issues at all running the exact same code. I’m on ubuntu22.04, python310, cuda12.1. P2P test seems to run without errors.

root39 · April 29, 2024, 9:34am

@kj140717 FWIW it seems the latest drivers do correctly report p2p as disabled for 4090, though I’ve since switched to tinygrad’s fork since it implements p2p.

Topic		Replies	Views
Parallel training with 4 cards 4090 cannot be performed on AMD 5975WX， stuck at the beginning CUDA Programming and Performance	14	5973	February 20, 2023
P2P not working for P600s? CUDA Programming and Performance	7	1819	April 5, 2018
P2P issue using two RTX 5090 GPUs CUDA Programming and Performance cuda , kernel , ubuntu , linux-driver	12	1492	March 16, 2025
NVidia driver 520.61.05 / Cuda 11.8 / RTX 3090 = black display and superslow modesets Linux cuda , ubuntu	21	24840	December 6, 2022
Low P2P GPU bandwidth performance between GeForce GPUs CUDA Programming and Performance	20	1012	October 9, 2024
Using GTX 590 cards for CUDA SLI cards under CUDA? CUDA Programming and Performance	37	14258	April 2, 2012
P2P access not enabled, is this a software or a hardware issue? CUDA Setup and Installation	7	9709	November 10, 2015
Keep getting "GPU has fallen off the bus" with 3090 cards on Gigabyte MZ32-AR1 Rev 3.0 motherboard Linux gaming	19	376	July 7, 2025
why "all CUDA-capable devices are busy or unavailable" ? CUDA Programming and Performance	34	64464	April 20, 2011
GPU Utilization Drops after Consecutive Executions CUDA Programming and Performance	28	5747	October 2, 2013

Standard nVidia CUDA tests fail with dual RTX 4090 Linux box

Related topics