GPU failing after Ubuntu 20.04 install

I’m running a setup with two RTX 2080, and recently I’ve noticed one of them is not working properly: sometimes after booting it does not show up in the task manager. Although, most times it works fine. I’ve noticed the GPU which does not work is in the second slot of the motherboard, and, if I try changing it to the first slot both GPUs won’t work.
Before this problem started to occur I had updated my OS to Ubuntu 20.04 aswell as the nvidia drivers, which may have some relation.

Does anyone have some suggestions on it?

nvidia-bug-report.log.gz (707.0 KB)

Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

Ok, done.

[ 7146.204589] Out of memory: Killed process 13081 (python) total-vm:121117196kB, anon-rss:112920660kB, file-rss:0kB, shmem-rss:4kB, UID:1000 pgtables:223272kB oom_score_adj:0
[ 7148.376317] oom_reaper: reaped process 13081 (python), now anon-rss:0kB, file-rss:0kB, shmem-rss:4kB
[40243.967391] input: BluetoothMouse3600 Mouse as /devices/virtual/misc/uhid/0005:045E:0916.0006/input/input37
[40243.967647] input: BluetoothMouse3600 Consumer Control as /devices/virtual/misc/uhid/0005:045E:0916.0006/input/input38
[40243.967741] hid-generic 0005:045E:0916.0006: input,hidraw2: BLUETOOTH HID v1.00 Mouse [BluetoothMouse3600] on 00:1a:7d:da:71:13
[41516.278956] NVRM: GPU at PCI:0000:09:00: GPU-4e7f9176-3ec2-1bd4-2049-884a136e8c82
[41516.278959] NVRM: GPU Board Serial Number:
[41516.278961] NVRM: Xid (PCI:0000:09:00): 43, pid=24693, Ch 00000018

Seems you run into multiple out of memory conditions, and immediately after a Xid 43 is thrown.
https://docs.nvidia.com/deploy/xid-errors/index.html
Not more info according in above doc, then “GPU stopped processing”.

No more errors, that I can see… generix might spot more.

You have an old xorg.conf on /etc/X11 laying around generated by driver 430 installation. Don’t think you need that. Might try removing it.
Also BIOS is quite old (2018), might check for an update.

Never seen an XID 43 before but according to
https://docs.nvidia.com/deploy/xid-errors/index.html#topic_5_4
it should be non-fatal (at least for the gpu/driver) which seems confirmed by the fact that nvidia-smi still reports it in a working state.
Since you’re saying “does not show up in the task manager”, I guess you’re running some application on it? Or what do you mean by that?

Hey, Mart.
I appreciate the response!
I’ve deleted the .conf file.
I don’t have a flash drive at home at the moment, I will have to wait a bit for the BIOS update.

In the docs it mentions that XID 43 are due to user application. It comes to mind a project that I’ve been working on these past months in which there was a memory leak. The script I ran generated many workers to do some task, some of them used the GPU. The memory leak would eventually cause the PC to freeze, momentarily, and the application to crash.
Poor wording. I was referring to the system management interface (i.e. the card does not display in the nvidia-smi in the terminal).
The card is working seemingly normal right now. It wasn’t a couple of days ago. I ‘fixed’ simply by rebooting the PC. But, this is odd. This periodical failure has been recurring for the last few months.

I guess the XID in the logs has nothing to do with the issue you’re facing. Please create a new nvidia-bug-report.log the next time you’re running into it.

Ok, I’ll try to reproduce the error and post back.