Standard nVidia CUDA tests fail with dual RTX 4090 Linux box

We can confirm that with driver 525.105.17 the locking problem is fixed and UVA (Unified Virtual Adressing) also seems to work with 2x RTX 4090 without error on AMD EPYC and Threadripper. With working UVA also NCCL works → multi GPU training with Pytorch and Tensorflow work.

Finally 2x RTX 4090 can be used for deep learning training, the missing P2P performance probably hinders scaling with more than two RTX 4090 because of bad all gather and all reduce performance… but at least now they are useable.


This is great to hear!

So now the question is, how can I downgrade from 530 to 525 properly on ubuntu 20.04?

Or any idea when the change will make it into the 530 branch?


Hey, can you explain what you said? You said it’s usable? Why is the performance bad then?

I said that it is useable (it does not crash or hang anymore) and it calculates correct results (which was also not the case in older versions). I did not say that it is good or fast and already doubted that it makes sense to use more than two RTX 4090s. What are your experiences with the performance?

Can you explain why it’s not good?

The missing P2P performance probably hinders scaling with more than two RTX 4090 because of bad all gather and all reduce performance.

The missing functionality makes it as slow as CPU.

If you have the $ for 2x 4090, you should pick up 2x A5500 instead. The memory > the processing for inference.

< Before

< After

The After pic destroys the performance of 2x4090. Not even close.

You need a completely un-hobbled GPU, including p2p to do anything. Unless you are using multiple systems and building your own p2p, 4090 is a total waste of time.

  • note on the bracket: I changed cases later + bracket is no longer necessary. Old case needed it. LianLi O11XL fits everything without issue.

Has there been any progress on this?

I can get 2x 4090s working via NCCL_P2P_DISABLE=1. But moving to 3x 4090s barely improves throughput over 2x 4090s.


This is really interesting, I can confirm that running different training jobs we are having the same issues described here.

One thing I noticed is that when I train for example a LLM models or fine-tuning them, it scale fine between GPUs using multi GPUs 4090, we have tested with 4x GPUs, but when running a vision model like resnet50 the system crash or lose performance, for example: It start the resnet50 training and after 2 minutes performance goes down like 80%, the GPUs are being used at 100% at the beginning using all the power of the GPU and then after 2 minutes wattage consumption goes to 120w or less, and this is not related to hardware since we test multiple system and scenarios, it seems that is all related to the P2P being locked on Gforce cards.

Can you please @abchauhan let us know if this is permanent and what it means for workstations then with multi GPUs, so we cannot train DL models using multi 4090s and we must buy Quadro GPUs?

Also in Windows 11 then for 3D rendering this P2P will be enable and it will not be affected or how it works in that case.

Please @abchauhan confirming that is important for us to make the right business decision.


@kinred How did you verify UVA and NCCL is working?

By running following tests:

They are not comprehensive, but did not work at all when this thread was started.

Thank you!

This is still affecting me on driver 545.29.06, kernel 6.7.0. I’m getting the same output on as OP with simpleP2P.

Has there been any updates to this?
On my multi-4090 AMD system, I can train with multi-gpu. But when deploying an LLM across multiple gpus, it raises device-side index assertion error. If I turn off all virtualizations, the system hangs when attempting to infer from multiple gpus.
I believe this is related to the issue in this thread as another system of mine with multi-3090 has no issues at all running the exact same code. I’m on ubuntu22.04, python310, cuda12.1. P2P test seems to run without errors.

@kj140717 FWIW it seems the latest drivers do correctly report p2p as disabled for 4090, though I’ve since switched to tinygrad’s fork since it implements p2p.