We can confirm that with driver 525.105.17 the locking problem is fixed and UVA (Unified Virtual Adressing) also seems to work with 2x RTX 4090 without error on AMD EPYC and Threadripper. With working UVA also NCCL works → multi GPU training with Pytorch and Tensorflow work.
Finally 2x RTX 4090 can be used for deep learning training, the missing P2P performance probably hinders scaling with more than two RTX 4090 because of bad all gather and all reduce performance… but at least now they are useable.
I said that it is useable (it does not crash or hang anymore) and it calculates correct results (which was also not the case in older versions). I did not say that it is good or fast and already doubted that it makes sense to use more than two RTX 4090s. What are your experiences with the performance?
The After pic destroys the performance of 2x4090. Not even close.
You need a completely un-hobbled GPU, including p2p to do anything. Unless you are using multiple systems and building your own p2p, 4090 is a total waste of time.
note on the bracket: I changed cases later + bracket is no longer necessary. Old case needed it. LianLi O11XL fits everything without issue.
This is really interesting, I can confirm that running different training jobs we are having the same issues described here.
One thing I noticed is that when I train for example a LLM models or fine-tuning them, it scale fine between GPUs using multi GPUs 4090, we have tested with 4x GPUs, but when running a vision model like resnet50 the system crash or lose performance, for example: It start the resnet50 training and after 2 minutes performance goes down like 80%, the GPUs are being used at 100% at the beginning using all the power of the GPU and then after 2 minutes wattage consumption goes to 120w or less, and this is not related to hardware since we test multiple system and scenarios, it seems that is all related to the P2P being locked on Gforce cards.
Can you please @abchauhan let us know if this is permanent and what it means for workstations then with multi GPUs, so we cannot train DL models using multi 4090s and we must buy Quadro GPUs?
Also in Windows 11 then for 3D rendering this P2P will be enable and it will not be affected or how it works in that case.
Please @abchauhan confirming that is important for us to make the right business decision.
Has there been any updates to this?
On my multi-4090 AMD system, I can train with multi-gpu. But when deploying an LLM across multiple gpus, it raises device-side index assertion error. If I turn off all virtualizations, the system hangs when attempting to infer from multiple gpus.
I believe this is related to the issue in this thread as another system of mine with multi-3090 has no issues at all running the exact same code. I’m on ubuntu22.04, python310, cuda12.1. P2P test seems to run without errors.
@kj140717 FWIW it seems the latest drivers do correctly report p2p as disabled for 4090, though I’ve since switched to tinygrad’s fork since it implements p2p.