Indeed, the more important problem is that UVA on RTX 4090 is not working correctly. Which is actually the part where the simpleP2P test failed.
Th UVA not working is a big problem for an application Iām working on too. After debugging it for several days I posted about it just a few days ago when I tracked it down to UVA failing in a simple test program.
I tested it and it works on Windows 11 with the 528 driver, but Iām running into other problems with that. Iām keeping my fingers crossed that the next Linux driver release will resolve the problem. It would be great if we could get some feedback from Nvidia about what to expect on that though.
I have requested for additional feedback. (Bug 3931150 for tracking internally)
I am setting up a system to reproduce this. Iāll file a new bug if this remains an open issue. Thanks.
I can confirm that dual 4090s work on Windows 10 too with pytorch (latest driver 528.49, CUDA 12.0), it however freezes on Ubuntu 22.04 (latest driver 525.89.02, CUDA 12.0). So this is not a hardware limitation, but a driver/software issue. Please fix asap. Thank you!
Hello @abchauhan-- Do you have a target dates for fix of the issue? There are many modules and packages out there that leverage DDP and are currently broken for multi-gpu use. Thank you!
Hi, @abchauhan
Iāve upgrade to the latest cuda 12.1 with driver 530.30.02.
simpleP2P still failed with report that checking P2P is YES.
Hi,
Peer-to-peer is disabled for all GeForce Ada Desktop cards. There were driver issues related to Peer-to-peer which were recently fixed for these cards.
The release candidate is being finalized. I will share the target dates, driver versions which includes the fix as soon as that information is available. Unfortunately, the latest Beta version - 530.30.02 does not include the fix.
I can reproduce the hang with this application as well as with the examples at NCCL all_reduce_perf test hangs with multiple RTX 4090 GPUs, works fine when I swap in 2080tis Ā· Issue #117 Ā· NVIDIA/nccl-tests Ā· GitHub.
I have verified that there are no hangs on drivers with the fix.
Thanks
It turns out that on Windows 10 the P2P is disabled by default so pytorch did not freeze.
Hi all,
To clarify, peer-to-peer is disabled on all GeForce Ada Desktop cards and the fix will resolve the application hangs, crashes and incorrect answers.
The fix doesnāt enable peer-to-peer on these cards. Peer-to-peer on these cards is an unsupported configuration.
Thank you
is there an estimate of when the fix will be available?
Hi @suprnerd
The driver release is scheduled for the end of this month (March 2023). I will share the release version information when itās available.
Thank you
Thanks for the update @abchauhan
I understand that P2P is not and wonāt be supported on Ada GeForce cards. Do you know if the problems with Unified memory will be fixed with this release? Or I guess if that is even supported? It works for me with two 4080ās on Windows and I donāt see anything in the documentation to suggest it shouldnāt work. Iām assuming it should be supported, but someone correct me if Iām wrong.
Indeed, it would be realy appreciated if NVIDIA communicates if multi-GPU setups with Geforce GPUs are officially supported or not.
This we have introduced a mysteries bug which make GPUs stall or calculate wrong values and we donāt know if it is solvable situation is unprofessional.
In the meantime can you make sure that this thread doesnāt automatically close after 14 days? Thanks!
Do you know if the problems with Unified memory will be fixed with this release?
Can you please share the forum postās link for this? Is there a NVIDIA bug number?
Indeed, it would be realy appreciated if NVIDIA communicates if multi-GPU setups with Geforce GPUs are officially supported or not.
Iāll share this feedback with the teams.
Thank you