Hello, We found that some standard nVidia tests fail with dual nVidia RTX 4090 system. It is CUDA 11.8, driver 520.61.05 running on Linux dl 5.15.0-52-generic #58-Ubuntu SMP Thu Oct 13 08:03:55 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux TEST-1: executables_v2/bin/x86_64/linux/release/simpleP2P [e…

Hi @vasilii.shelkov , Can you please check if you see the same failures with iommu disabled? On Ubuntu, please edit /etc/default/grub, append amd_iommu=off to the options in the line starting with GRUB_CMDLINE_LINUX=.... Save. Run #update-grub2 and reboot. Thank you

Did it work and scale with amd_iommu turned off?

With amd_iommu off we get the same errors with “nan” is being replaced by 0s: bizon@dl:~$ ~/kuklin/cuda-samples-12/cuda-samples/bin/x86_64/linux/release/simpleP2P [/home/bizon/kuklin/cuda-samples-12/cuda-samples/bin/x86_64/linux/release/simpleP2P] - Starting… Checking for multiple GPUs… CUDA-cap…

Here is log nvidia-bug-report.log.gz - Google Drive

I noticed that your system is from Bizon. Aren’t they able to help since they configured your system? Rather, aren’t they testing system for parallel training/gpu-computing before shipping it?

It looks like nVidia bug and unlikely that Bizon or their training could help here. This is the ticket number: “NVIDIA PSIRT” PSIRT@nvidia.com ; Bug report: standard nVidia P2P tests: 3902559

Hi vasilii.shelkov Thank you for the additional information. We can reproduce this issue on our systems. This is under investigation.

Hello @abchauhan – How do I receive an update for this issue? I tried sending email to PSIRT@nvidia.com about issue 3902559, but haven’t hear back yet. Is there a portal I can sign in to see the trace of updates? Thanks

@abchauhan This is a very serious issue and has already been reproduced by a number of people: Problems With RTX4090 MultiGPU and AMD vs Intel vs RTX6000Ada or RTX3090 | Puget Systems It seems new A6000 Ada cards are also affected on AMD CPUs. I have ordered 8 such cards, so I hope this will be fix…

Standard nVidia CUDA tests fail with dual RTX 4090 Linux box

Graphics / Linux Linux

abchauhan January 24, 2023, 8:51pm 2

Hi Vasilii,

Apologies for the delay. Can you please capture a Nvidia bug report from your system.
Please run nvidia-bug-report.sh as root or sudo user and attach the generated nvidia-bug-report.log.gz.
Can you also please provide the make/model of the motherboard and the system.

Thank you

Topic		Replies	Views
TAO5 - Detectnet_v2 - MultiGPU TAO API Stuck - EXTRA GPU TAO Toolkit	13	1300	November 7, 2023
Parallel training with 4 cards 4090 cannot be performed on AMD 5975WX， stuck at the beginning CUDA Programming and Performance	13	6330	February 6, 2023
我用的CPU是AMD 5975WX，显卡是4块4090。cuda版本为cuda12，pytoch版本为2.0 Linux chinese	6	2146	December 29, 2022
One GPU NOT capable of Peer-to-Peer (P2P) CUDA Programming and Performance	22	5684	November 27, 2018
Low P2P GPU bandwidth performance between GeForce GPUs CUDA Programming and Performance	20	2183	October 9, 2024
P2P not working for P600s? CUDA Programming and Performance	7	1960	April 5, 2018
P2P Communication Fails 1080ti->1080ti. IOMMU & ACS disabled Linux	1	1797	February 15, 2020
multi-GPU Peer to Peer access CUDA SDK example not working, why? CUDA Programming and Performance	13	5386	February 26, 2015
Peer access not supported between devices CUDA Programming and Performance	11	7646	November 9, 2017
Partial fail of peer access in 8 Volta GPU instance (p3.16xlarge) on AWS -> huge slowdown CUDA Programming and Performance	32	4123	March 10, 2018

Standard nVidia CUDA tests fail with dual RTX 4090 Linux box

Related topics