Standard nVidia CUDA tests fail with dual RTX 4090 Linux box

Hello,

We found that some standard nVidia tests fail with dual nVidia RTX 4090 system.

It is CUDA 11.8, driver 520.61.05 running on Linux dl 5.15.0-52-generic #58-Ubuntu
SMP Thu Oct 13 08:03:55 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

TEST-1:
executables_v2/bin/x86_64/linux/release/simpleP2P
[executables_v2/bin/x86_64/linux/release/simpleP2P] - Starting…
Checking for multiple GPUs…
CUDA-capable device count: 2

Checking GPU(s) for support of peer to peer memory access…

Peer access from NVIDIA GeForce RTX 4090 (GPU0) → NVIDIA GeForce RTX 4090 (GPU1) : Yes
Peer access from NVIDIA GeForce RTX 4090 (GPU1) → NVIDIA GeForce RTX 4090 (GPU0) : Yes
Enabling peer access between GPU0 and GPU1…

Allocating buffers (64MB on GPU0, GPU1 and CPU Host)…
Creating event handles…
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 25.19GB/s
Preparing host buffer and memcpy to GPU0…
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1…
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0…
Copy data back to host from GPU0 and verify results…
Verification error @ element 0: val = nan, ref = 0.000000
Verification error @ element 1: val = nan, ref = 4.000000
Verification error @ element 2: val = nan, ref = 8.000000
Verification error @ element 3: val = nan, ref = 12.000000
Verification error @ element 4: val = nan, ref = 16.000000
Verification error @ element 5: val = nan, ref = 20.000000
Verification error @ element 6: val = nan, ref = 24.000000
Verification error @ element 7: val = nan, ref = 28.000000
Verification error @ element 8: val = nan, ref = 32.000000
Verification error @ element 9: val = nan, ref = 36.000000
Verification error @ element 10: val = nan, ref = 40.000000
Verification error @ element 11: val = nan, ref = 44.000000
Disabling peer access…
Shutting down…
Test failed!

TEST-2
executables_v2/bin/x86_64/linux/release/OrderedAllocationIPC
Step 0 done
Step 1 done
Process 0: verifying…
Process 0: Verification mismatch at 0: 0 != 1
Process 0: Verification mismatch at 1: 0 != 1
Process 0: Verification mismatch at 2: 0 != 1
Process 0: Verification mismatch at 3: 0 != 1
Process 0: Verification mismatch at 4: 0 != 1
Process 0: Verification mismatch at 5: 0 != 1
Process 0: Verification mismatch at 6: 0 != 1
.
.
.

Here is system info:

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Vendor ID: AuthenticAMD
Model name: AMD Ryzen Threadripper PRO 5975WX 32-Cores
CPU family: 25
Model: 8
Thread(s) per core: 1
Core(s) per socket: 32

Hi Vasilii,

Apologies for the delay. Can you please capture a Nvidia bug report from your system.
Please run nvidia-bug-report.sh as root or sudo user and attach the generated nvidia-bug-report.log.gz.
Can you also please provide the make/model of the motherboard and the system.

Thank you

Hi @vasilii.shelkov ,

Can you please check if you see the same failures with iommu disabled?

On Ubuntu, please edit /etc/default/grub, append amd_iommu=off to the options in the line starting with GRUB_CMDLINE_LINUX=.... Save. Run #update-grub2 and reboot.

Thank you

Did it work and scale with amd_iommu turned off?