Issues with 3090

londere87 · January 5, 2022, 2:18am

Hello mates!, hoping you and your beloved ones are OK. I need assistance asap: I had a problem I had a 3090 paired with a Xeon V5 1235 (I can’t upgrade since this is not my PC, it is from my lab) as MoBo it had a P10S-E/4L with 32GB of RAM Corsair Dominator DDR4 with NVMe drive, SSD drive and 1 TB Western Digital, and an AIO TT Water3.0. It’s on Linux (Ubuntu 20.04) and the latest NVIDIA CUDA and CUDNN driver.

The issue is the next this PC is used to research (Deep Learning) before changing to a 1080TI mini the system was OK, I can let it run over months without a problem, but when I do the swap to the 3090 it’s constantly giving XID 79, 13, 48 any thoughts what could it be? I forgot to mention my PSU is a Corsair RM850.

generix · January 5, 2022, 11:03am

Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.
Most common causes for XID 79 are lack of power (possible power spikes) and overheating.

londere87 · January 5, 2022, 5:37pm

Thanks for your reply @generix please, you can find the result here:
nvidia-bug-report.log.gz (237.4 KB)

In addition, I use TF, Pytorch and Pycuda development frameworks, in other forum i see the usage of GPUMemTest (Windows) and it throws OK result.

Also, I attach a pic of the system.

generix · January 5, 2022, 10:18pm

There are also some aer messages but I don’t know if that’s part of the cause or just a follow up of the gpu shutting down. Please check if this is a power issue by limiting clocks, e.g.
nvidia-smi -lgc 300,1500
A systm bios update is also worth a try if available.

londere87 · January 6, 2022, 1:12am

Bios is not available at this time, any other suggestion? I will try the mentioned command and let you know

londere87 · January 6, 2022, 8:32pm

Those errors can be caused by CSM and Fastboot options of the BIOS? Later, I will update the results to set the clock to the proposed frequency and disabled persistance mode.

Mart · January 6, 2022, 8:47pm

In my humble opinion, as dirty as this looks on the picture, you should really clean the whole computer! AFAIK that can also have an impact on function.

londere87 · January 6, 2022, 9:03pm

Yes, I know that I cleaned out right now

generix · January 6, 2022, 9:24pm

AER messages (pcie bus errors) shouldn’t be caused by csm or fastboot.

londere87 · January 6, 2022, 9:35pm

After limiting the clocks the result is the same, here I post a walkthrough:
1.- Clean all the PC
2.- Reflash Stock Bios and Restore Default Setting (CSM Disabled)
3.- Test on Windows (Post logs of Furmark, Corsair iCue, GPUMemTest)
4.- Test again on Ubuntu
@P10S-E-Series:~$ nvidia-smi -lgc 300,1500
GPU clocks set to “(gpuClkMin 300, gpuClkMax 1500)” for GPU 00000000:01:00.0

Warning: persistence mode is disabled on device 00000000:01:00.0. See the Known Issues section of the nvidia-smi(1) man page for more information. Run with [–help | -h] switch to get more information on how to enable persistence mode.
All done.
@P10S-E-Series:~$ nvidia-smi
Thu Jan 6 14:01:32 2022
±----------------------------------------------------------------------------+
| NVIDIA-SMI 470.86 Driver Version: 470.86 CUDA Version: 11.4 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce … Off | 00000000:01:00.0 Off | N/A |
| 0% 44C P2 217W / 350W | 23887MiB / 24259MiB | 62% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 31045 C /bin/python3 23885MiB |
±----------------------------------------------------------------------------+
@P10S-E-Series:~$ nvidia-smi
Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error

Here you can find the new nvidia-bug-report.log.gz
corsair_cue_20220106_13_09_00.csv (10.2 KB)
FurMark_0001.txt (2.9 KB)

gpumemtest_devtest3090.txt (786 Bytes)
gpumemtest_exectest3090.txt (4.6 KB)

nvidia-bug-report.log.gz (175.8 KB)

generix · January 6, 2022, 9:49pm

This time this was more clear, it’s the mainboard

[ 4159.497434] pcieport 0000:00:01.0: AER: Multiple Corrected error received: 0000:00:01.0
[ 4159.547208] pcieport 0000:00:01.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[ 4159.547213] pcieport 0000:00:01.0:   device [8086:1901] error status/mask=00000001/00002000
[ 4159.547215] pcieport 0000:00:01.0:    [ 0] RxErr                 
[ 4159.547221] pcieport 0000:00:01.0: AER: Uncorrected (Fatal) error received: 0000:00:01.0
[ 4159.547225] pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
[ 4159.547241] fbcon: Taking over console
[ 4159.547246] pcieport 0000:00:01.0:   device [8086:1901] error status/mask=00004000/00000000
[ 4159.547250] pcieport 0000:00:01.0:    [14] CmpltTO                (First)

The root bus the 3090 is connected to breaks down, so the gpu is lost afterwards.

londere87 · January 6, 2022, 9:53pm

In this case, should I send the card to warranty? Or is any solution via SW or HW? Or maybe I misunderstand, it’s my mobo?

generix · January 6, 2022, 9:58pm

The 3090 is probably fine, the mobo just can’t handle it.
Might also be caused by bad grounding/shielding. Looking at the picture, at least no risers involved.

londere87 · January 6, 2022, 10:04pm

Then I need to talk with my lab head about it, only and final question I can plug it in the following scheme only to diagnose the 3090?

GPU—>Riser—> PCIx8?

I ask this because with an 1080Ti this issue was not present, but my model (Deep Learning) is too large, that’s why we swap to an 3090…

generix · January 6, 2022, 10:10pm

The 3090 is faster, has tensorcores and stuff so it stresses the bus more.
I don’t understand your question, are you using a riser after all? Then this is most likely the cause.

londere87 · January 6, 2022, 10:13pm

No, It’s directly to the bus, I had 2 risers cable from thermaltake. One with caps and other without caps, my question is if I can plug the card to the riser cable and test it in the x16 bus and in the x8 bus only to ensure the issue with the board (mobo)?

generix · January 6, 2022, 10:16pm

You can try but risers often introduce even more bus errors.
Maybe check if you can lower the pcie speeds to gen2 in bios for a simpler test.

gardotd426 · January 10, 2022, 12:27pm

I actually have what appears to be the exact same model as you - from the device ID you got in GPU-Z and the photo it looks like an EVGA XC3 Ultra RTX 3090 - that’s exactly what I have, I actually got it in-person on launch day at 9AM at Micro Center, so I’ve been running the card on Linux longer than any other consumer out there. I’ve had to RMA the card twice, but due to completely unrelated issues (the first RMA was caused actually by my motherboard, luckily EVGA’s warranty covered it, the second was that the third fan went bad).

I’ve never had these kinds of crashes that you’re experiencing, and while I’ve never run any sort of NN workloads, I’ve put it under full gaming workloads using the tensor cores and RT cores along with the CUDA cores, all at the same time.

As generix said, this absolutely looks like a motherboard issue. It’s not at all hard to imagine a 4-5 year old 1080 Ti not causing issues that the 3090 actually triggers. The amount of difference in GPU power is insane, I believe in just plain rasterization the 3090 is more than double the GPU power of a 1080 Ti (plus the gen 3 tensor cores and gen 2 RT cores, which the 1080 Ti has neither of).

If you would give me some workloads that I could easily run (I already have cuda installed) that trigger the issue for you, I could run them and see what results I have.

@P10S-E-Series:~$ nvidia-smi
Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error

This is also weird.

But yeah, if there are some benchmarks, workloads, or torture tests that trigger issues for you let me know what they are and I can try them on my system. I have tensorflow-cuda, cuda, and cudnn already installed.

londere87 · January 10, 2022, 1:04pm

Thanks for your kind reply @gardotd426, yeah I know that the huge amount difference of power delivered by the two cards, we swap to the 3090 because our models are becoming too large and too heavy and our plan was to build a new workstation but due to current social situation and gov. policies we have a cut to our founding… Our card is a EVGA GeForce RTX 3090 XC3 ULTRA GAMING that i think as you mentioned is the same as you.

I think I forgot to mention that the model (NN, DL) before a kernel update was working OK (I mean compiling and do their stuff) but I keep testing now with a new kernel and the advices of @generix an until now no crashes at all I want to get this thread alive in order to see if anyone has the same issue. When I deploy a simple NN (MLP for example) or a DL (Transfer Learning on Computer Vision Task) I had no issues with my dataset (150k images around 100 GB).

Also, I didn’t mention that I disable the Intel Speedstep and Speedshift and take off a function that calculates a difference between the Siamese networks (euclidean distance), I think this also was the problem because it does the computation on CPU and not in GPU.

Topic		Replies	Views
MSI 3090 GPU causes a full system crash when under any sort of load -- Ubuntu 18.04LTS Linux cuda , ubuntu	6	2512	October 12, 2021
3090 power throttles around 300w Linux	9	747	April 18, 2024
"GPU has fallen off the bus" on GTX 1070 Linux	38	24089	April 5, 2021
Has anyone been able to run an RTX 3060 laptop GPU at more than 80W on Linux? Linux	110	37631	March 13, 2024
nvidia 387.12 breaks power reading in nvidia-smi. Linux	27	12575	March 30, 2021
Nvidia-settings gives errors 3090ti egpu dell laptop Ubuntu Linux ubuntu	8	1248	August 15, 2022
3090ti Problem Nvidia suport unable to answer. Can anyone answer this? Linux	6	863	June 18, 2022
Power Limit on 3000 Mobile Series Linux	24	12854	March 18, 2024
GPU Utilization Drops after Consecutive Executions CUDA Programming and Performance	28	5716	October 2, 2013
nvidia-smi power limit on GTX 1060 Linux	61	53550	February 13, 2018

Issues with 3090

Related topics