RTX 2080ti driver crashing randomly with or without load ERR! 39C PERR! / 250W

Hello! I have been trying to solve an issue with an RTX 2080ti in Ubuntu 18.04, secure boot disabled.

I error message on the power from the fan in nvidia-smi, and sometimes the machine got hanged in the shutdown after rebooting (is in the shutdown as some services got stuck but I cannot longer access)

I have tried with run files driver 440, 450 and 460, currently Im using the driver from apt-install nvidia-driver-460.

I can only access via SSH to this machine, so any changes on the BIOS is difficult.
Sometimes the output from nvidia-smi is totally normal and suddenly crash, showing ERR! 39C PERR! / 250W, other times " No devices were found" and others " the driver cant communicate with the device…"

As I mentioned any physical intervention such change the card slot or reinstall the whole system is quite difficult as I can only access via SSH.

Please find attached the bug report and hopefully is not hardware issue.

Thanks !

nvidia-bug-report.log (1.7 MB)

You’re always getting this shortly after boot:

Jan 28 15:53:30 aerovision-Z390-AORUS-ELITE kernel: NVRM: Xid (PCI:0000:01:00): 61, pid=2743, 0d20(31f0) 0500a3db 2f1327a6

There’s a known problem
https://forums.developer.nvidia.com/t/random-xid-61-and-xorg-lock-up/79731/338
(try setting clocks

sudo nvidia-smi -lgc 1000,2145

)
but since this is happening always instantly after boot, this might also be a general hardware problem (defective gpu).

Thanks I will try it out !

Indeed there was a previous RTX 2080 ti and it was defective, (this probably caused because the server got some power outages repetitively) after I went to warranty and got a new one, but the person that installed used the same slot as the old one, could the slot on the motherboard got spoiled as well?

I will try that command anyways.

Best Regards

I’d say there’s only a very low chance the slot being involved. This would also show different errors and failures.

I have ran the command, reboot and then nvidia-smi same output, reboot 2nd time and now nvidia-smi output is “no devices were found”

here is the last bug log

nvidia-bug-report.log (1.2 MB) :

The gpu is broken, now it has given up completely:

[   15.333042] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1290)

You need to read the manuals.

http://us.download.nvidia.com/XFree86/Linux-x86_64/460.39/README/index.html

Theres so much wrong with how youve configured the system. I havent got time to get into it
You are far from close.

Ive read this quickly.

Your missing nvidia libraries?

And only the nVidia audio driver is loading your MBoard ,

AMD is on and fighting with and the Nouveau and FBuffer and Vesa Driver.

Turn secureboot off forever.
turn csm of forevr
( As per the latest nVidia instruction )

Turn AMD GPU driver off in BIOS.

Turn AMD Graphics SLice Off in BIOS

So youve got no blacklisting or modeseting.
So blacklist everything and modeset everything.

The Cuda driver is installed and running. Why?
Uninstall it for now.

Your DMA buffers arent set to 64bit in bios.
Set above 4G decoding in bios to 64bit. ENABLED

Disable IOAPIC 24-119 Entries in BIOS

Legacy USB Support disabled

XHCI Hand-off disabled

PCIE Bifurcation Support PCIE x16, : This has to be set to everything on pcie 1 slot

eIGFX On board graphics off
ePCIe 1 Slot Sets the graphics card on the PCIEX16 slot as the first display. (Default)
ePCIe 2 Slot not used

ezraid off. I think youve got the pci e slot running as a raid

VGA Support? Im not sure about this Id leave it to legacy first until you get up and running.

PS2 Devices Support ?? Not sure I would disable this. This is connected to echi / iommu /xchi
theres been changes in the kernel with xinput…libev…so on. disable this.

Boot Option Priorities uefi / secureboot /csm off.

Profile DDR Voltage
When using a non-XMPmemorymodule or ExtremeMemory Profile (X.M.P.) is set to Disabled

AVX Offset disable if you can

Graphics Slice Ratio off these are for AMD Carsds only and Radeon at that.

Graphics UnSlice Ratio off

Intel® Turbo Boost Technology turn off for now

Max Link Speed set to auto or gen 3

Internal Graphics set to DISBABLED

Set both DVMT options to minimum even though Internal Graphics is disabled

RC6(Render Standby) SET TO DISABLED ANY WAY

Platform Power Management [ DISABLE ]

Id bet there a WRONG config issue with M.2 stogare RAID support aswell.

I would do a clean install once youve fixed settings.
secureboot off. lm-sensors, fw-update, PCI-ID utils need to be run.

PAT CPU Support on.
Blacklist and modeset everything.

Chapter 38. Addressing Capabilities? nVidia manual.
Whys your card reporting 47bits?

Driver fails to initialize when MSI interrupts are enabled,
Not the issue but be aware of it.

Chapter 19. Configuring Flipping and UBB: maybe enable

outta time.
Good luck

After changing to ubntu 20.04 and fixing some BIOS issues the problem persisted, finally went to warranty and the card was faulty, got a new 3070 and working perfectly.

Thanks for the support.