Standard nVidia CUDA tests fail with dual RTX 4090 Linux box

Indeed, the more important problem is that UVA on RTX 4090 is not working correctly. Which is actually the part where the simpleP2P test failed.

2 Likes

Th UVA not working is a big problem for an application I’m working on too. After debugging it for several days I posted about it just a few days ago when I tracked it down to UVA failing in a simple test program.

I tested it and it works on Windows 11 with the 528 driver, but I’m running into other problems with that. I’m keeping my fingers crossed that the next Linux driver release will resolve the problem. It would be great if we could get some feedback from Nvidia about what to expect on that though.

1 Like

I have requested for additional feedback. (Bug 3931150 for tracking internally)

I am setting up a system to reproduce this. I’ll file a new bug if this remains an open issue. Thanks.

3 Likes

I can confirm that dual 4090s work on Windows 10 too with pytorch (latest driver 528.49, CUDA 12.0), it however freezes on Ubuntu 22.04 (latest driver 525.89.02, CUDA 12.0). So this is not a hardware limitation, but a driver/software issue. Please fix asap. Thank you!

5 Likes

Hello @abchauhan-- Do you have a target dates for fix of the issue? There are many modules and packages out there that leverage DDP and are currently broken for multi-gpu use. Thank you!

Hi, @abchauhan
I’ve upgrade to the latest cuda 12.1 with driver 530.30.02.
simpleP2P still failed with report that checking P2P is YES.

Hi,

Peer-to-peer is disabled for all GeForce Ada Desktop cards. There were driver issues related to Peer-to-peer which were recently fixed for these cards.

The release candidate is being finalized. I will share the target dates, driver versions which includes the fix as soon as that information is available. Unfortunately, the latest Beta version - 530.30.02 does not include the fix.

I can reproduce the hang with this application as well as with the examples at NCCL all_reduce_perf test hangs with multiple RTX 4090 GPUs, works fine when I swap in 2080tis · Issue #117 · NVIDIA/nccl-tests · GitHub.
I have verified that there are no hangs on drivers with the fix.

Thanks

6 Likes

It turns out that on Windows 10 the P2P is disabled by default so pytorch did not freeze.

2 Likes

Hi all,

To clarify, peer-to-peer is disabled on all GeForce Ada Desktop cards and the fix will resolve the application hangs, crashes and incorrect answers.
The fix doesn’t enable peer-to-peer on these cards. Peer-to-peer on these cards is an unsupported configuration.

Thank you

1 Like

is there an estimate of when the fix will be available?

Hi @suprnerd

The driver release is scheduled for the end of this month (March 2023). I will share the release version information when it’s available.

Thank you

4 Likes

Thanks for the update @abchauhan

I understand that P2P is not and won’t be supported on Ada GeForce cards. Do you know if the problems with Unified memory will be fixed with this release? Or I guess if that is even supported? It works for me with two 4080’s on Windows and I don’t see anything in the documentation to suggest it shouldn’t work. I’m assuming it should be supported, but someone correct me if I’m wrong.

2 Likes

Indeed, it would be realy appreciated if NVIDIA communicates if multi-GPU setups with Geforce GPUs are officially supported or not.
This we have introduced a mysteries bug which make GPUs stall or calculate wrong values and we don’t know if it is solvable situation is unprofessional.

3 Likes

In the meantime can you make sure that this thread doesn’t automatically close after 14 days? Thanks!

3 Likes

Do you know if the problems with Unified memory will be fixed with this release?

Can you please share the forum post’s link for this? Is there a NVIDIA bug number?

Indeed, it would be realy appreciated if NVIDIA communicates if multi-GPU setups with Geforce GPUs are officially supported or not.

I’ll share this feedback with the teams.

Thank you