Standard nVidia CUDA tests fail with dual RTX 4090 Linux box

Hello,

We found that some standard nVidia tests fail with dual nVidia RTX 4090 system.

It is CUDA 11.8, driver 520.61.05 running on Linux dl 5.15.0-52-generic #58-Ubuntu
SMP Thu Oct 13 08:03:55 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

TEST-1:
executables_v2/bin/x86_64/linux/release/simpleP2P
[executables_v2/bin/x86_64/linux/release/simpleP2P] - Starting…
Checking for multiple GPUs…
CUDA-capable device count: 2

Checking GPU(s) for support of peer to peer memory access…

Peer access from NVIDIA GeForce RTX 4090 (GPU0) → NVIDIA GeForce RTX 4090 (GPU1) : Yes
Peer access from NVIDIA GeForce RTX 4090 (GPU1) → NVIDIA GeForce RTX 4090 (GPU0) : Yes
Enabling peer access between GPU0 and GPU1…

Allocating buffers (64MB on GPU0, GPU1 and CPU Host)…
Creating event handles…
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 25.19GB/s
Preparing host buffer and memcpy to GPU0…
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1…
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0…
Copy data back to host from GPU0 and verify results…
Verification error @ element 0: val = nan, ref = 0.000000
Verification error @ element 1: val = nan, ref = 4.000000
Verification error @ element 2: val = nan, ref = 8.000000
Verification error @ element 3: val = nan, ref = 12.000000
Verification error @ element 4: val = nan, ref = 16.000000
Verification error @ element 5: val = nan, ref = 20.000000
Verification error @ element 6: val = nan, ref = 24.000000
Verification error @ element 7: val = nan, ref = 28.000000
Verification error @ element 8: val = nan, ref = 32.000000
Verification error @ element 9: val = nan, ref = 36.000000
Verification error @ element 10: val = nan, ref = 40.000000
Verification error @ element 11: val = nan, ref = 44.000000
Disabling peer access…
Shutting down…
Test failed!

TEST-2
executables_v2/bin/x86_64/linux/release/OrderedAllocationIPC
Step 0 done
Step 1 done
Process 0: verifying…
Process 0: Verification mismatch at 0: 0 != 1
Process 0: Verification mismatch at 1: 0 != 1
Process 0: Verification mismatch at 2: 0 != 1
Process 0: Verification mismatch at 3: 0 != 1
Process 0: Verification mismatch at 4: 0 != 1
Process 0: Verification mismatch at 5: 0 != 1
Process 0: Verification mismatch at 6: 0 != 1
.
.
.

Here is system info:

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Vendor ID: AuthenticAMD
Model name: AMD Ryzen Threadripper PRO 5975WX 32-Cores
CPU family: 25
Model: 8
Thread(s) per core: 1
Core(s) per socket: 32

3 Likes

Hi Vasilii,

Apologies for the delay. Can you please capture a Nvidia bug report from your system.
Please run nvidia-bug-report.sh as root or sudo user and attach the generated nvidia-bug-report.log.gz.
Can you also please provide the make/model of the motherboard and the system.

Thank you

Hi @vasilii.shelkov ,

Can you please check if you see the same failures with iommu disabled?

On Ubuntu, please edit /etc/default/grub, append amd_iommu=off to the options in the line starting with GRUB_CMDLINE_LINUX=.... Save. Run #update-grub2 and reboot.

Thank you

Did it work and scale with amd_iommu turned off?

With amd_iommu off we get the same errors with “nan” is being replaced by 0s:

bizon@dl:~$ ~/kuklin/cuda-samples-12/cuda-samples/bin/x86_64/linux/release/simpleP2P
[/home/bizon/kuklin/cuda-samples-12/cuda-samples/bin/x86_64/linux/release/simpleP2P] - Starting…
Checking for multiple GPUs…
CUDA-capable device count: 2

Checking GPU(s) for support of peer to peer memory access…

Peer access from NVIDIA GeForce RTX 4090 (GPU0) → NVIDIA GeForce RTX 4090 (GPU1) : Yes
Peer access from NVIDIA GeForce RTX 4090 (GPU1) → NVIDIA GeForce RTX 4090 (GPU0) : Yes
Enabling peer access between GPU0 and GPU1…
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)…
Creating event handles…
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 25.14GB/s
Preparing host buffer and memcpy to GPU0…
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1…
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0…
Copy data back to host from GPU0 and verify results…
Verification error @ element 1: val = 0.000000, ref = 4.000000
Verification error @ element 2: val = 0.000000, ref = 8.000000
Verification error @ element 3: val = 0.000000, ref = 12.000000
Verification error @ element 4: val = 0.000000, ref = 16.000000
Verification error @ element 5: val = 0.000000, ref = 20.000000
Verification error @ element 6: val = 0.000000, ref = 24.000000
Verification error @ element 7: val = 0.000000, ref = 28.000000
Verification error @ element 8: val = 0.000000, ref = 32.000000
Verification error @ element 9: val = 0.000000, ref = 36.000000
Verification error @ element 10: val = 0.000000, ref = 40.000000
Verification error @ element 11: val = 0.000000, ref = 44.000000
Verification error @ element 12: val = 0.000000, ref = 48.000000
Disabling peer access…
Shutting down…
Test failed!

Here is log nvidia-bug-report.log.gz - Google Drive

I noticed that your system is from Bizon. Aren’t they able to help since they configured your system? Rather, aren’t they testing system for parallel training/gpu-computing before shipping it?

1 Like

It looks like nVidia bug and unlikely that Bizon or their training could help here. This is the ticket number:

“NVIDIA PSIRT” PSIRT@nvidia.com; Bug report: standard nVidia P2P tests: 3902559

2 Likes

Hi vasilii.shelkov

Thank you for the additional information. We can reproduce this issue on our systems. This is under investigation.

8 Likes

Hello @abchauhan – How do I receive an update for this issue? I tried sending email to PSIRT@nvidia.com about issue 3902559, but haven’t hear back yet. Is there a portal I can sign in to see the trace of updates?

Thanks

@abchauhan This is a very serious issue and has already been reproduced by a number of people: Problems With RTX4090 MultiGPU and AMD vs Intel vs RTX6000Ada or RTX3090 | Puget Systems

It seems new A6000 Ada cards are also affected on AMD CPUs. I have ordered 8 such cards, so I hope this will be fixed soon.

I was able to reproduce with my 4090s too, on AMD EPYC. The only thing that prevents hang is NCCL_P2P_DISABLE=1, but the performance is subpar.

We can not confirm that the RTX 6000 Ada GPUs have this problem on AMD EPYC or WRX80 based CPUs. P2P copy has not to be disabled when using RTX 6000 Ada GPUs.

More important: the transfered data is correct. Multi RTX 6000 Ada setups seem to work without problems.

Some findings for the multi RTX 4090 setups:

  • When disable P2P copy with NCCL_P2P_DISABLE on AMD EPYC/WRX80 the locking problem can be by-passed, but then the transfered data between the GPUs is not copied correct! (destination data is all 0 or all NaN). This can be tested with for example:
  • The multi GPU RTX 4090 problem is not specific to AMD CPUs on the Intel CPUs we tested (for example XEON Silver 4309Y) the transfer is not blocked (NCCL_P2P_DISABLE has no effect) but the data is also not copied correct (destination all 0 or NaN)! This is independed of if NCCL_P2P_DISABLE is set or not (which of course should have no effect, as above example uses directly CUDA and not the higher level NCCL library).

The RTX 4090 is currently not useable for multi GPU usage, neither on Intel nor AMD. The reason from our analysis seems to be a broken? CUDA UVA implementation.

4 Likes

Is this a hardware or software/firmware issue?

1 Like

I have just run into this issue as well using a 2X 4090 setup with i9-10980XE CPU. I too have been using the simpleP2P test which is broken when p2p is enabled. Doing a bit of hacking, it does appear that cudaMemcpyPeer works as expected. In my case the cudaMemcpyPeer gave 10.54GB/s vs 12.5 Gb/s when the p2paccess is enabled.

@abchauhan Any idea on when we should expect a fix? Can also confirm the same issue with p2p on a multi RTX 4090 setup with AMD WRX80.

1 Like

Hi all,

Apologies for the delay. Feedback from Engineering is that Peer to Peer is not supported on 4090. The applications/driver should not report this configuration as peer to peer capable. The reporting is being fixed and future drivers will report the following instead :-

I. # ./simpleP2P
[./simpleP2P] - Starting…
Checking for multiple GPUs…
CUDA-capable device count: 2

Checking GPU(s) for support of peer to peer memory access…
/> Peer access from NVIDIA GeForce RTX 4090 (GPU0) → NVIDIA GeForce RTX 4090 (GPU1) : No
/> Peer access from NVIDIA GeForce RTX 4090 (GPU1) → NVIDIA GeForce RTX 4090 (GPU0) : No
Two or more GPUs with Peer-to-Peer access capability are required for ./simpleP2P.
Peer to Peer access is not available amongst GPUs in the system, waiving test.

II. ./streamOrderedAllocationIPC
Device 1 is not peer capable with some other selected peers, skipping
Step 0 done
Process 0: verifying…
Process 0 complete!

Thank you

What about other cards? 4080,70,60? new gen A6000 and all others that are coming up?

@abchauhan Did the Engineering team provide any reasoning as to why P2P is not supported on the 4090?

Are there any plans to add support for P2P in the future?

2 Likes

The 3090 also does not support P2P, which is not a problem. It can give the same results about simple P2P as 3090, I think it is no problem

cuda-samples/Samples/0_Introduction/simpleP2P at master · NVIDIA/cuda-samples · GitHub
This is just one of the typical test questions

The key problem is that using pytorch for DataParallel or DistributedDataParallel will cause the program to freeze and even the server to crash

For example 2 here Multi-GPU Computing with Pytorch (Draft) [ 2. DataParallel: MNIST on multiple GPUs]
The above program will hang
or for example by GitHub - pytorch/benchmark: TorchBench is a collection of open source benchmarks used to evaluate PyTorch performance.

python test.py -k "test_BERT_pytorch_train_cuda"

The above program will cause the server to crash。But it can run normally in NGC on the Intel Xeon platform without server to crash

4 Likes