Standard nVidia CUDA tests fail with dual RTX 4090 Linux box

Indeed, the more important problem is that UVA on RTX 4090 is not working correctly. Which is actually the part where the simpleP2P test failed.

2 Likes

Th UVA not working is a big problem for an application I’m working on too. After debugging it for several days I posted about it just a few days ago when I tracked it down to UVA failing in a simple test program.

I tested it and it works on Windows 11 with the 528 driver, but I’m running into other problems with that. I’m keeping my fingers crossed that the next Linux driver release will resolve the problem. It would be great if we could get some feedback from Nvidia about what to expect on that though.

1 Like

I have requested for additional feedback. (Bug 3931150 for tracking internally)

I am setting up a system to reproduce this. I’ll file a new bug if this remains an open issue. Thanks.

3 Likes

I can confirm that dual 4090s work on Windows 10 too with pytorch (latest driver 528.49, CUDA 12.0), it however freezes on Ubuntu 22.04 (latest driver 525.89.02, CUDA 12.0). So this is not a hardware limitation, but a driver/software issue. Please fix asap. Thank you!

5 Likes

Hello @abchauhan-- Do you have a target dates for fix of the issue? There are many modules and packages out there that leverage DDP and are currently broken for multi-gpu use. Thank you!

Hi, @abchauhan
I’ve upgrade to the latest cuda 12.1 with driver 530.30.02.
simpleP2P still failed with report that checking P2P is YES.

Hi,

Peer-to-peer is disabled for all GeForce Ada Desktop cards. There were driver issues related to Peer-to-peer which were recently fixed for these cards.

The release candidate is being finalized. I will share the target dates, driver versions which includes the fix as soon as that information is available. Unfortunately, the latest Beta version - 530.30.02 does not include the fix.

I can reproduce the hang with this application as well as with the examples at https://github.com/NVIDIA/nccl-tests/issues/117.
I have verified that there are no hangs on drivers with the fix.

Thanks

6 Likes

It turns out that on Windows 10 the P2P is disabled by default so pytorch did not freeze.

4 Likes

Hi all,

To clarify, peer-to-peer is disabled on all GeForce Ada Desktop cards and the fix will resolve the application hangs, crashes and incorrect answers.
The fix doesn’t enable peer-to-peer on these cards. Peer-to-peer on these cards is an unsupported configuration.

Thank you

2 Likes

is there an estimate of when the fix will be available?

Hi @suprnerd

The driver release is scheduled for the end of this month (March 2023). I will share the release version information when it’s available.

Thank you

4 Likes

Thanks for the update @abchauhan

I understand that P2P is not and won’t be supported on Ada GeForce cards. Do you know if the problems with Unified memory will be fixed with this release? Or I guess if that is even supported? It works for me with two 4080’s on Windows and I don’t see anything in the documentation to suggest it shouldn’t work. I’m assuming it should be supported, but someone correct me if I’m wrong.

2 Likes

Indeed, it would be realy appreciated if NVIDIA communicates if multi-GPU setups with Geforce GPUs are officially supported or not.
This we have introduced a mysteries bug which make GPUs stall or calculate wrong values and we don’t know if it is solvable situation is unprofessional.

5 Likes

In the meantime can you make sure that this thread doesn’t automatically close after 14 days? Thanks!

3 Likes

Do you know if the problems with Unified memory will be fixed with this release?

Can you please share the forum post’s link for this? Is there a NVIDIA bug number?

Indeed, it would be realy appreciated if NVIDIA communicates if multi-GPU setups with Geforce GPUs are officially supported or not.

I’ll share this feedback with the teams.

Thank you

I take it that the new 530.41.03 driver doesn’t solve this problem yet?

From the tone of the discussion it seems 4090s will have P2P locked. In my case, it renders the 4090 worthless.

I ended up having an internal debate between A100 40G / A5500 24G x2 + opted for 2x A5500. It is in the ballpark of a pair of 4090s + it packs the performance of the A100 40G that, I have a feeling we both may have been aiming for.

A consideration if you are able to make the hardware switch.

For HPC we’re stuck with using appropriate hardware. Not necessarily a bad thing. Tools tend to be more functional than toys.

An A5500 is nearly as fast as a RTX 3090 … for 4090 performance one has to go for a RTX 6000 Ada.

That’s nice and all, but it’s useless. 24G of ram is nothing. 5500 has P2P/SLI 3090 has neither.

With the poor capabilities of the 4090 it’s worthless as anything but a toy. I returned 2 of them. At least I can use 100% of an A5500.

Sure, may be faster but if it’s all blocked from use what good does the speed at playing games do for training and inference?

It did nothing for me. Gave me gibberish & unreliable results. Seems a few of us in this thread have shared experiences.

Hi all,

Driver version 525.105.17 has the changes that resolve the application hangs and crashes. CUDA sample tests will report that P2P is not supported.

Please let us know if you continue to see any failures.

The current 530.xx driver - 530.41.03 does not include the changes. The next releases from the 530 branch should pick up the changes.

Thank you

1 Like