RTX 2080 Super, drivers not working (Ubuntu 20.04 or 22.04)

I can’t get the nvidia drivers to work on Ubuntu 22.04 (or 20.04 before that). I’ve tried pretty much all the possible driver versions (470, 510, 525, 535, 545, …). The drivers install fine, the modules load, nvidia-smi works, but X doesn’t load.

$ sudo X :0
Fatal server error:
[   924.220] (EE) NVIDIA: A GPU exception occurred during X server initialization
[   924.220] (EE)
[   924.220] (EE)

nvidia-bug-report.log.gz (473.9 KB)

The problems started when I ran a do-release-upgrade from 20.04 to 22.04, but I could not get back to a working configuration. Finally, I did a fresh install of Ubuntu 20.04 LTS, but this did not help.

In the current configuration, I’m on the 5.15 kernel (Ubuntu 22.04 LTS). But I’ve also been on the

$ sudo dmesg | grep -i nvidia
[    1.073599] nvidia-gpu 0000:0b:00.3: enabling device (0000 -> 0002)
[    2.429527] nvidia-gpu 0000:0b:00.3: i2c timeout error e0000000
[    5.541281] nvidia: loading out-of-tree module taints kernel.
[    5.541291] nvidia: module license 'NVIDIA' taints kernel.
[    5.590959] nvidia-nvlink: Nvlink Core is being initialized, major device number 510
[    5.597811] nvidia 0000:0b:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[    5.617433] audit: type=1400 audit(1708689239.795:2): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=800 comm="apparmor_parser"
[    5.617439] audit: type=1400 audit(1708689239.795:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=800 comm="apparmor_parser"
[    5.643667] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:03.1/0000:0b:00.1/sound/card0/input6
[    5.643703] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:03.1/0000:0b:00.1/sound/card0/input7
[    5.643745] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:03.1/0000:0b:00.1/sound/card0/input8
[    5.643776] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:03.1/0000:0b:00.1/sound/card0/input9
[    5.643804] input: HDA NVidia HDMI/DP,pcm=10 as /devices/pci0000:00/0000:00:03.1/0000:0b:00.1/sound/card0/input10
[    5.643841] input: HDA NVidia HDMI/DP,pcm=11 as /devices/pci0000:00/0000:00:03.1/0000:0b:00.1/sound/card0/input11
[    5.643877] input: HDA NVidia HDMI/DP,pcm=12 as /devices/pci0000:00/0000:00:03.1/0000:0b:00.1/sound/card0/input12
[    5.644638] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  525.147.05  Wed Oct 25 20:27:35 UTC 2023
[    5.656984] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  525.147.05  Wed Oct 25 20:21:31 UTC 2023
[    5.674943] [drm] [nvidia-drm] [GPU ID 0x00000b00] Loading driver
[   11.938134] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:0b:00.0 on minor 0
[   12.108434] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[   12.111661] nvidia-uvm: Loaded the UVM driver, major device number 508.
$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:0B:00.0  On |                  N/A |
| 24%   64C    P0    78W / 250W |      1MiB /  8192MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

You’re always getting Xid 109+8 on Xserver start. Might point to a hardware issue. Furthermore, nvidia-smi is only reporting x8 pcie link, should be x16. Please try reseating the gpu in its slot.

Thanks.

I removed the card completely from its slot and reinstalled it. No change unfortunately.

nvidia-smi -q reports 8x with a max of 16x. That’s probably due to throttling at low work loads, correct?

Is there a way to test the hardware and/or upgrade firmware?

What does Xid 109+8 mean?

Xid:
https://docs.nvidia.com/deploy/xid-errors/index.html
x8 is only used if the mainboard slot only supports it, not related to workload. Since it’s a prebuilt Alienware gaming desktop, I would expect it to use x16.

Apparently x8 is a thing:
https://www.dell.com/community/en/conversations/alienware-desktops/aurora-r10-why-x8-pci-slot-for-graphics-card/647f897ef4ccf8a8de91d063

I’m at the end of my ideas, so am very happy to hear new ones. 🙂

So x8 is unexpected but correct.
You could try running gpu-burn for 10 minutes to check for a faulty gpu or put it into a different computer to test.

Ok, thanks for the continued input.

I don’t have another computer at hand to install the card, so I tried the gpu-burn route.

Note: I needed to install CUDA to compile gpu-burn. I used CUDA 11.8 because eventually I want to use a program that does not support CUDA 12. After adding the repo, I also directly installed the drivers with apt install cuda-drivers, which gave me version 550. No change overall, but I wanted to mention this because the driver versions changed since the first post.

gpu-burn runs for 10 minutes and exits OK:

$ ./Software/gpu-burn/gpu_burn -d 600
Using compare file: compare.ptx
Burning for 600 seconds.
GPU 0: NVIDIA GeForce RTX 2080 SUPER (UUID: GPU-4ce53796-e623-25d5-967c-2ccbb1fa53ad)

Killing processes with SIGTERM (soft kill)
done

Tested 1 GPUs:
	GPU 0: OK

I created a new nvidia-bug-report.log.gz (444.5 KB) while gpu-burn was running.

So I also tried running a PyTorch example from pytorch/examples. This hangs at the start of the program

$ python3 main.py --cuda --epochs 6 --log-interval 1

and nvidia-smi shows that only 10MB of memory is allocated on the GPU. Could this be getting closer to the root of the problem?

$ nvidia-smi
Tue Feb 27 10:54:42 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 2080 ...    Off |   00000000:0B:00.0  On |                  N/A |
| 24%   64C    P0             80W /  250W |      13MiB /   8192MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     13285      C   python3                                        10MiB |
+-----------------------------------------------------------------------------------------+

Note, that the same code runs find on the CPU:

$ python3 main.py --epochs 1 --log-interval 1
WARNING: You have a CUDA device, so you should probably run with --cuda.
| epoch   1 |     1/ 2983 batches | lr 20.00 | ms/batch 390.24 | loss 20.46 | ppl 765026818.28
| epoch   1 |     2/ 2983 batches | lr 20.00 | ms/batch 188.19 | loss  9.48 | ppl 13130.37
| epoch   1 |     3/ 2983 batches | lr 20.00 | ms/batch 188.50 | loss  9.82 | ppl 18409.33
| epoch   1 |     4/ 2983 batches | lr 20.00 | ms/batch 187.59 | loss  9.21 | ppl  9964.66
| epoch   1 |     5/ 2983 batches | lr 20.00 | ms/batch 188.47 | loss  9.56 | ppl 14252.84
...

So I tried another GPU burn with explicitly specifying the memory:

$ ./Software/gpu-burn/gpu_burn -d -m 1000 10
Using compare file: compare.ptx
Burning for 10 seconds.
GPU 0: NVIDIA GeForce RTX 2080 SUPER (UUID: GPU-4ce53796-e623-25d5-967c-2ccbb1fa53ad)

Killing processes with SIGTERM (soft kill)
done

Tested 1 GPUs:
	GPU 0: OK

This worked fine too, but the nvidia-smi still showed only 10 MB of memory:

$ nvidia-smi
Tue Feb 27 10:57:38 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 2080 ...    Off |   00000000:0B:00.0  On |                  N/A |
| 24%   64C    P0             80W /  250W |      13MiB /   8192MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     13332      C   ./Software/gpu-burn/gpu_burn                   10MiB |
+-----------------------------------------------------------------------------------------+

gpu seems to be fine, you could also test the video memory:
https://github.com/GpuZelenograd/memtest_vulkan

Well, this doesn’t seem to be working:

$ ./memtest_vulkan
https://github.com/GpuZelenograd/memtest_vulkan v0.4.0 by GpuZelenograd
To finish testing use Ctrl+C

1: Bus=0x0B:00 DevId=0x1E81   8GB NVIDIA GeForce RTX 2080 SUPER
2: Bus=0x00:00 DevId=0x0000   32GB llvmpipe (LLVM 15.0.7, 256 bits)
(first device will be autoselected in 0 seconds)   Override index to test:
    ...first device autoselected
Runtime error: a Vulkan function returned a negative `Result` value
Runtime error: a Vulkan function returned a negative `Result` value
Runtime error: a Vulkan function returned a negative `Result` value
Runtime error: a Vulkan function returned a negative `Result` value
Runtime error: a Vulkan function returned a negative `Result` value
Runtime error: a Vulkan function returned a negative `Result` value
^C
memtest_vulkan: INIT OR FIRST testing failed due to runtime error
  press any key to continue...

gpu-burn didn’t really run, it hung on start and was killed with Xid 8. Also the memory test errored out so I strongly suspect the gpu is broken.
All that is left is doing a clean reinstall of the OS and check if anything changes.

Argh, OK, thanks. This is already a fresh install of Ubuntu. I’m considering an install of Windows as a final test.

I presume there is no real path forward to repair a broken card, correct?

Not really.

Just to follow-up:

I tried installing Windows 11. The installer ran, but I got 5 minutes into the configuration and then the display went dark—couldn’t get it back.

I got a replacement RTX 4060 which is working excellently so far :)

Now what to do with the defective RTX 2080 SUPER… anyone have any ideas?

Thanks for the handholding throught this!