RTX 2080 Super, drivers not working (Ubuntu 20.04 or 22.04)

gabriel81 · February 23, 2024, 12:18pm

I can’t get the nvidia drivers to work on Ubuntu 22.04 (or 20.04 before that). I’ve tried pretty much all the possible driver versions (470, 510, 525, 535, 545, …). The drivers install fine, the modules load, nvidia-smi works, but X doesn’t load.

$ sudo X :0
Fatal server error:
[   924.220] (EE) NVIDIA: A GPU exception occurred during X server initialization
[   924.220] (EE)
[   924.220] (EE)

nvidia-bug-report.log.gz (473.9 KB)

The problems started when I ran a do-release-upgrade from 20.04 to 22.04, but I could not get back to a working configuration. Finally, I did a fresh install of Ubuntu 20.04 LTS, but this did not help.

In the current configuration, I’m on the 5.15 kernel (Ubuntu 22.04 LTS). But I’ve also been on the

$ sudo dmesg | grep -i nvidia
[    1.073599] nvidia-gpu 0000:0b:00.3: enabling device (0000 -> 0002)
[    2.429527] nvidia-gpu 0000:0b:00.3: i2c timeout error e0000000
[    5.541281] nvidia: loading out-of-tree module taints kernel.
[    5.541291] nvidia: module license 'NVIDIA' taints kernel.
[    5.590959] nvidia-nvlink: Nvlink Core is being initialized, major device number 510
[    5.597811] nvidia 0000:0b:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[    5.617433] audit: type=1400 audit(1708689239.795:2): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=800 comm="apparmor_parser"
[    5.617439] audit: type=1400 audit(1708689239.795:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=800 comm="apparmor_parser"
[    5.643667] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:03.1/0000:0b:00.1/sound/card0/input6
[    5.643703] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:03.1/0000:0b:00.1/sound/card0/input7
[    5.643745] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:03.1/0000:0b:00.1/sound/card0/input8
[    5.643776] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:03.1/0000:0b:00.1/sound/card0/input9
[    5.643804] input: HDA NVidia HDMI/DP,pcm=10 as /devices/pci0000:00/0000:00:03.1/0000:0b:00.1/sound/card0/input10
[    5.643841] input: HDA NVidia HDMI/DP,pcm=11 as /devices/pci0000:00/0000:00:03.1/0000:0b:00.1/sound/card0/input11
[    5.643877] input: HDA NVidia HDMI/DP,pcm=12 as /devices/pci0000:00/0000:00:03.1/0000:0b:00.1/sound/card0/input12
[    5.644638] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  525.147.05  Wed Oct 25 20:27:35 UTC 2023
[    5.656984] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  525.147.05  Wed Oct 25 20:21:31 UTC 2023
[    5.674943] [drm] [nvidia-drm] [GPU ID 0x00000b00] Loading driver
[   11.938134] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:0b:00.0 on minor 0
[   12.108434] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[   12.111661] nvidia-uvm: Loaded the UVM driver, major device number 508.

$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:0B:00.0  On |                  N/A |
| 24%   64C    P0    78W / 250W |      1MiB /  8192MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

generix · February 26, 2024, 10:56am

You’re always getting Xid 109+8 on Xserver start. Might point to a hardware issue. Furthermore, nvidia-smi is only reporting x8 pcie link, should be x16. Please try reseating the gpu in its slot.

gabriel81 · February 26, 2024, 2:13pm

Thanks.

I removed the card completely from its slot and reinstalled it. No change unfortunately.

nvidia-smi -q reports 8x with a max of 16x. That’s probably due to throttling at low work loads, correct?

Is there a way to test the hardware and/or upgrade firmware?

What does Xid 109+8 mean?

generix · February 26, 2024, 2:18pm

Xid:
https://docs.nvidia.com/deploy/xid-errors/index.html
x8 is only used if the mainboard slot only supports it, not related to workload. Since it’s a prebuilt Alienware gaming desktop, I would expect it to use x16.

gabriel81 · February 26, 2024, 2:50pm

Apparently x8 is a thing:
https://www.dell.com/community/en/conversations/alienware-desktops/aurora-r10-why-x8-pci-slot-for-graphics-card/647f897ef4ccf8a8de91d063

I’m at the end of my ideas, so am very happy to hear new ones. 🙂

generix · February 26, 2024, 3:54pm

So x8 is unexpected but correct.
You could try running gpu-burn for 10 minutes to check for a faulty gpu or put it into a different computer to test.

gabriel81 · February 27, 2024, 11:03am

Ok, thanks for the continued input.

I don’t have another computer at hand to install the card, so I tried the gpu-burn route.

Note: I needed to install CUDA to compile gpu-burn. I used CUDA 11.8 because eventually I want to use a program that does not support CUDA 12. After adding the repo, I also directly installed the drivers with apt install cuda-drivers, which gave me version 550. No change overall, but I wanted to mention this because the driver versions changed since the first post.

gpu-burn runs for 10 minutes and exits OK:

$ ./Software/gpu-burn/gpu_burn -d 600
Using compare file: compare.ptx
Burning for 600 seconds.
GPU 0: NVIDIA GeForce RTX 2080 SUPER (UUID: GPU-4ce53796-e623-25d5-967c-2ccbb1fa53ad)

Killing processes with SIGTERM (soft kill)
done

Tested 1 GPUs:
	GPU 0: OK

I created a new nvidia-bug-report.log.gz (444.5 KB) while gpu-burn was running.

So I also tried running a PyTorch example from pytorch/examples. This hangs at the start of the program

$ python3 main.py --cuda --epochs 6 --log-interval 1

and nvidia-smi shows that only 10MB of memory is allocated on the GPU. Could this be getting closer to the root of the problem?

$ nvidia-smi
Tue Feb 27 10:54:42 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 2080 ...    Off |   00000000:0B:00.0  On |                  N/A |
| 24%   64C    P0             80W /  250W |      13MiB /   8192MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     13285      C   python3                                        10MiB |
+-----------------------------------------------------------------------------------------+

Note, that the same code runs find on the CPU:

$ python3 main.py --epochs 1 --log-interval 1
WARNING: You have a CUDA device, so you should probably run with --cuda.
| epoch   1 |     1/ 2983 batches | lr 20.00 | ms/batch 390.24 | loss 20.46 | ppl 765026818.28
| epoch   1 |     2/ 2983 batches | lr 20.00 | ms/batch 188.19 | loss  9.48 | ppl 13130.37
| epoch   1 |     3/ 2983 batches | lr 20.00 | ms/batch 188.50 | loss  9.82 | ppl 18409.33
| epoch   1 |     4/ 2983 batches | lr 20.00 | ms/batch 187.59 | loss  9.21 | ppl  9964.66
| epoch   1 |     5/ 2983 batches | lr 20.00 | ms/batch 188.47 | loss  9.56 | ppl 14252.84
...

So I tried another GPU burn with explicitly specifying the memory:

$ ./Software/gpu-burn/gpu_burn -d -m 1000 10
Using compare file: compare.ptx
Burning for 10 seconds.
GPU 0: NVIDIA GeForce RTX 2080 SUPER (UUID: GPU-4ce53796-e623-25d5-967c-2ccbb1fa53ad)

Killing processes with SIGTERM (soft kill)
done

Tested 1 GPUs:
	GPU 0: OK

This worked fine too, but the nvidia-smi still showed only 10 MB of memory:

$ nvidia-smi
Tue Feb 27 10:57:38 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 2080 ...    Off |   00000000:0B:00.0  On |                  N/A |
| 24%   64C    P0             80W /  250W |      13MiB /   8192MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     13332      C   ./Software/gpu-burn/gpu_burn                   10MiB |
+-----------------------------------------------------------------------------------------+

generix · February 27, 2024, 1:14pm

gpu seems to be fine, you could also test the video memory:
https://github.com/GpuZelenograd/memtest_vulkan

gabriel81 · February 27, 2024, 3:33pm

Well, this doesn’t seem to be working:

$ ./memtest_vulkan
https://github.com/GpuZelenograd/memtest_vulkan v0.4.0 by GpuZelenograd
To finish testing use Ctrl+C

1: Bus=0x0B:00 DevId=0x1E81   8GB NVIDIA GeForce RTX 2080 SUPER
2: Bus=0x00:00 DevId=0x0000   32GB llvmpipe (LLVM 15.0.7, 256 bits)
(first device will be autoselected in 0 seconds)   Override index to test:
    ...first device autoselected
Runtime error: a Vulkan function returned a negative `Result` value
Runtime error: a Vulkan function returned a negative `Result` value
Runtime error: a Vulkan function returned a negative `Result` value
Runtime error: a Vulkan function returned a negative `Result` value
Runtime error: a Vulkan function returned a negative `Result` value
Runtime error: a Vulkan function returned a negative `Result` value
^C
memtest_vulkan: INIT OR FIRST testing failed due to runtime error
  press any key to continue...

generix · February 27, 2024, 10:36pm

gpu-burn didn’t really run, it hung on start and was killed with Xid 8. Also the memory test errored out so I strongly suspect the gpu is broken.
All that is left is doing a clean reinstall of the OS and check if anything changes.

gabriel81 · February 28, 2024, 5:30am

Argh, OK, thanks. This is already a fresh install of Ubuntu. I’m considering an install of Windows as a final test.

I presume there is no real path forward to repair a broken card, correct?

generix · February 28, 2024, 8:31am

Not really.

gabriel81 · February 29, 2024, 8:18pm

Just to follow-up:

I tried installing Windows 11. The installer ran, but I got 5 minutes into the configuration and then the display went dark—couldn’t get it back.

I got a replacement RTX 4060 which is working excellently so far :)

Now what to do with the defective RTX 2080 SUPER… anyone have any ideas?

Thanks for the handholding throught this!

Topic		Replies	Views
RTX 2080 Ti doesn't work for me on Fedora 30 Linux	28	1835	June 3, 2019
RTX 2080 SUPER not recognized on linux Linux	1	611	May 24, 2023
Ubuntu 18.04 and RTX 2080 SUPER systematically freezing Linux cuda , tensorflow , ubuntu	27	3749	October 12, 2021
RTX 2080 not read by ubuntu 18.04.1 Linux	21	24759	June 10, 2019
Nvidia driver is not working on Ubuntu 22.04 Linux linux , linux-driver	25	38823	February 20, 2025
Nvidia-smi uses all of ram and swap Linux	25	4618	May 8, 2025
Black screen after install of nvidia driver ubuntu Linux	224	160181	February 27, 2025
High CPU usage on xorg when the external monitor is plugged in Linux	120	38547	June 21, 2023
5 out of 8 GPUs are not detected with nvidia-smi GPU - Hardware nvidia-smi	3	1322	March 31, 2023
NVIDIA-SMI Shows ERR! on both Fan and Power Usage Linux	32	47079	August 30, 2022

RTX 2080 Super, drivers not working (Ubuntu 20.04 or 22.04)

Related topics