Issues with 3090

Hello mates!, hoping you and your beloved ones are OK. I need assistance asap: I had a problem I had a 3090 paired with a Xeon V5 1235 (I can’t upgrade since this is not my PC, it is from my lab) as MoBo it had a P10S-E/4L with 32GB of RAM Corsair Dominator DDR4 with NVMe drive, SSD drive and 1 TB Western Digital, and an AIO TT Water3.0. It’s on Linux (Ubuntu 20.04) and the latest NVIDIA CUDA and CUDNN driver.

The issue is the next this PC is used to research (Deep Learning) before changing to a 1080TI mini the system was OK, I can let it run over months without a problem, but when I do the swap to the 3090 it’s constantly giving XID 79, 13, 48 any thoughts what could it be? I forgot to mention my PSU is a Corsair RM850.

Please run as root and attach the resulting nvidia-bug-report.log.gz file to your post.
Most common causes for XID 79 are lack of power (possible power spikes) and overheating.

Thanks for your reply @generix please, you can find the result here:
nvidia-bug-report.log.gz (237.4 KB)

In addition, I use TF, Pytorch and Pycuda development frameworks, in other forum i see the usage of GPUMemTest (Windows) and it throws OK result.

Also, I attach a pic of the system.

There are also some aer messages but I don’t know if that’s part of the cause or just a follow up of the gpu shutting down. Please check if this is a power issue by limiting clocks, e.g.
nvidia-smi -lgc 300,1500
A systm bios update is also worth a try if available.

Bios is not available at this time, any other suggestion? I will try the mentioned command and let you know

Those errors can be caused by CSM and Fastboot options of the BIOS? Later, I will update the results to set the clock to the proposed frequency and disabled persistance mode.

In my humble opinion, as dirty as this looks on the picture, you should really clean the whole computer! AFAIK that can also have an impact on function.

Yes, I know that I cleaned out right now

AER messages (pcie bus errors) shouldn’t be caused by csm or fastboot.

After limiting the clocks the result is the same, here I post a walkthrough:
1.- Clean all the PC
2.- Reflash Stock Bios and Restore Default Setting (CSM Disabled)
3.- Test on Windows (Post logs of Furmark, Corsair iCue, GPUMemTest)
4.- Test again on Ubuntu
@P10S-E-Series:~$ nvidia-smi -lgc 300,1500
GPU clocks set to “(gpuClkMin 300, gpuClkMax 1500)” for GPU 00000000:01:00.0

Warning: persistence mode is disabled on device 00000000:01:00.0. See the Known Issues section of the nvidia-smi(1) man page for more information. Run with [–help | -h] switch to get more information on how to enable persistence mode.
All done.
@P10S-E-Series:~$ nvidia-smi
Thu Jan 6 14:01:32 2022
| NVIDIA-SMI 470.86 Driver Version: 470.86 CUDA Version: 11.4 |
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
| 0 NVIDIA GeForce … Off | 00000000:01:00.0 Off | N/A |
| 0% 44C P2 217W / 350W | 23887MiB / 24259MiB | 62% Default |
| | | N/A |

| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
| 0 N/A N/A 31045 C /bin/python3 23885MiB |
@P10S-E-Series:~$ nvidia-smi
Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error

Here you can find the new nvidia-bug-report.log.gz
corsair_cue_20220106_13_09_00.csv (10.2 KB)
FurMark_0001.txt (2.9 KB)

gpumemtest_devtest3090.txt (786 Bytes)
gpumemtest_exectest3090.txt (4.6 KB)

nvidia-bug-report.log.gz (175.8 KB)

This time this was more clear, it’s the mainboard

[ 4159.497434] pcieport 0000:00:01.0: AER: Multiple Corrected error received: 0000:00:01.0
[ 4159.547208] pcieport 0000:00:01.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[ 4159.547213] pcieport 0000:00:01.0:   device [8086:1901] error status/mask=00000001/00002000
[ 4159.547215] pcieport 0000:00:01.0:    [ 0] RxErr                 
[ 4159.547221] pcieport 0000:00:01.0: AER: Uncorrected (Fatal) error received: 0000:00:01.0
[ 4159.547225] pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
[ 4159.547241] fbcon: Taking over console
[ 4159.547246] pcieport 0000:00:01.0:   device [8086:1901] error status/mask=00004000/00000000
[ 4159.547250] pcieport 0000:00:01.0:    [14] CmpltTO                (First)

The root bus the 3090 is connected to breaks down, so the gpu is lost afterwards.

In this case, should I send the card to warranty? Or is any solution via SW or HW? Or maybe I misunderstand, it’s my mobo?

The 3090 is probably fine, the mobo just can’t handle it.
Might also be caused by bad grounding/shielding. Looking at the picture, at least no risers involved.

Then I need to talk with my lab head about it, only and final question I can plug it in the following scheme only to diagnose the 3090?

GPU—>Riser—> PCIx8?

I ask this because with an 1080Ti this issue was not present, but my model (Deep Learning) is too large, that’s why we swap to an 3090…

The 3090 is faster, has tensorcores and stuff so it stresses the bus more.
I don’t understand your question, are you using a riser after all? Then this is most likely the cause.

No, It’s directly to the bus, I had 2 risers cable from thermaltake. One with caps and other without caps, my question is if I can plug the card to the riser cable and test it in the x16 bus and in the x8 bus only to ensure the issue with the board (mobo)?

You can try but risers often introduce even more bus errors.
Maybe check if you can lower the pcie speeds to gen2 in bios for a simpler test.

I actually have what appears to be the exact same model as you - from the device ID you got in GPU-Z and the photo it looks like an EVGA XC3 Ultra RTX 3090 - that’s exactly what I have, I actually got it in-person on launch day at 9AM at Micro Center, so I’ve been running the card on Linux longer than any other consumer out there. I’ve had to RMA the card twice, but due to completely unrelated issues (the first RMA was caused actually by my motherboard, luckily EVGA’s warranty covered it, the second was that the third fan went bad).

I’ve never had these kinds of crashes that you’re experiencing, and while I’ve never run any sort of NN workloads, I’ve put it under full gaming workloads using the tensor cores and RT cores along with the CUDA cores, all at the same time.

As generix said, this absolutely looks like a motherboard issue. It’s not at all hard to imagine a 4-5 year old 1080 Ti not causing issues that the 3090 actually triggers. The amount of difference in GPU power is insane, I believe in just plain rasterization the 3090 is more than double the GPU power of a 1080 Ti (plus the gen 3 tensor cores and gen 2 RT cores, which the 1080 Ti has neither of).

If you would give me some workloads that I could easily run (I already have cuda installed) that trigger the issue for you, I could run them and see what results I have.

@P10S-E-Series:~$ nvidia-smi
Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error

This is also weird.

But yeah, if there are some benchmarks, workloads, or torture tests that trigger issues for you let me know what they are and I can try them on my system. I have tensorflow-cuda, cuda, and cudnn already installed.

Thanks for your kind reply @gardotd426, yeah I know that the huge amount difference of power delivered by the two cards, we swap to the 3090 because our models are becoming too large and too heavy and our plan was to build a new workstation but due to current social situation and gov. policies we have a cut to our founding… Our card is a EVGA GeForce RTX 3090 XC3 ULTRA GAMING that i think as you mentioned is the same as you.

I think I forgot to mention that the model (NN, DL) before a kernel update was working OK (I mean compiling and do their stuff) but I keep testing now with a new kernel and the advices of @generix an until now no crashes at all I want to get this thread alive in order to see if anyone has the same issue. When I deploy a simple NN (MLP for example) or a DL (Transfer Learning on Computer Vision Task) I had no issues with my dataset (150k images around 100 GB).

Also, I didn’t mention that I disable the Intel Speedstep and Speedshift and take off a function that calculates a difference between the Siamese networks (euclidean distance), I think this also was the problem because it does the computation on CPU and not in GPU.