NVidia driver 520.61.05 / Cuda 11.8 / RTX 3090 = black display and superslow modesets

After installing the latest cuda packages on my Ubuntu (5.15.0-48-generic) the graphics no longer works as it should, and the card became useless as a computing tool.
Symptoms:

  • Black screen, seems to to have problems with mode set
    part of dmesg:
[   31.888468] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67d:0:0:1120
[   39.815112] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67e:0:0:1129
[   47.762125] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67e:1:0:1129
[   87.802420] nvidia-modeset: WARNING: GPU:0: Unable to read EDID for display device AUS VG27B (HDMI-0)
  • Superslow nvidia-smi (actually gives a result but only after more than 10 s)
    nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
| 49%   63C    P0   121W / 350W |     17MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Also note high power draw for no running processes!

  • xorg also seems to be spinning full throttle without generation of something useful:
    inxi -t c
    Processes:
    CPU top: 5 of 356
    1: cpu: 97.7% command: xorg pid: 4279

System Info “inxi --admin --verbosity=7 --filter --no-host --width”, edited for brevity
System:
Kernel: 5.15.0-48-generic x86_64 bits: 64 compiler: gcc v: 11.2.0
parameters: BOOT_IMAGE=/boot/vmlinuz-5.15.0-48-generic
root=UUID=NNN ro quiet splash
vt.handoff=7
Console: pty pts/2 DM: GDM3 42.0
Distro: Ubuntu 22.04.1 LTS (Jammy Jellyfish)
Machine:
Type: Desktop System: ASUS product: N/A v: N/A serial:
Mobo: ASUSTeK model: ROG STRIX Z490-A GAMING v: Rev 1.xx
serial: UEFI: American Megatrends v: 2403
date: 10/27/2021
Memory:
RAM: total: 62.71 GiB used: 1.91 GiB (3.1%)
CPU:
Info: model: Intel Core i9-10900KF bits: 64 type: MT MCP arch: Comet Lake
family: 6 model-id: 0xA5 (165) stepping: 5 microcode: 0xF0
Graphics:
Device-1: NVIDIA GA102 [GeForce RTX 3090] vendor: Micro-Star MSI
driver: nvidia v: 520.61.05 alternate: nvidiafb,nouveau,nvidia_drm pcie:
gen: 3 speed: 8 GT/s lanes: 16 link-max: gen: 4 speed: 16 GT/s ports:
active: none off: HDMI-A-1 empty: DP-1,DP-2,DP-3 bus-ID: 01:00.0
chip-ID: 10de:2204 class-ID: 0300
Display: server: X.org v: 1.21.1.3 with: Xwayland v: 22.1.1 driver: X:
loaded: nouveau,vesa unloaded: fbdev,modesetting gpu: nvidia tty: 100x42

If I remove all nvidia/cuda stuff and then install nvidia-driver-510-server (Driver Version: 510.85.02) then the graphics come back.

However, it then now fails to get Tensorflow (2.11.0-dev20221005) to use the GPU due to “Could not load dynamic library ‘libnvinfer.so.7’”, and I am not able to get all the needed libraries together to get it back into the functioning state without apt dragging in the new nvidia drivers again.

I have attached a bug-report (which takes forever to complete, again probably because the modeset is slow or timing out before continuing)
nvidia-bug-report.log.gz (440.4 KB)

2 Likes

In preparation for testing the next generation RTX4090 I upgraded to CUDA11.8 on two workstations with similar FATAL RESULT:
Black display - noticed with either Ubuntu 20.04 with RTX3090 as well with 22.04 and RTX3080Ti.

Both systems runnning kernel 5.15.0-48-generic

The workstation with Ubuntu 22.04 sometines rejects SSH-connects, top is showing 100% load of Nvida+, then Xorg and 2 minutes later of plymouthd.

NVIDIA - PLEASE FIX THIS ASAP!

1 Like

This is probably the same problem as Black X11 Screen and partial lockup when upgraded to 515.76 on RTX3060 .

The suggested temporary solution to switch to DisplayPort is working. And it seems that we will get a fix in the next update!

Thanks, but still no success.
I just tried to switch from HDMI to Displayport on both systems, 20.04 and 22.04, still black display, via SSH top shows 100% load of Xorg even 10 min. after reboot.

Just wanted to add “me too”. This is on a clean, fresh Ubuntu 22.04 with RTX 3090. . CUDA 11.7/ 515.65.01 works perfectly. CUDA 11.8/520 fails to boot as described in your post.

1 Like

Hi All,
We are aware of this issue and it has been root caused.
Fix is integrated for future release driver.

3 Likes

Same issue with an A6000

I have the same problem, and found it a little hard to revert to the old version. Here are the command that saved me:

sudo apt-get purge nvidia*
sudo apt-get install cuda=11.7.1-1 cuda-drivers=515.65.07-1 libcudnn8-dev=8.5.0.96-1+cuda11.7 libcudnn8=8.5.0.96-1+cuda11.7

I found some help in that post:

The key to find the old version name is to use “apt list -a cuda”, “apt list -a libcudnn8”.

amrits, can you describe the workaround so we can install 11.8? The current deb install isn’t just unusable, it makes systems unbootable. Seems like a it should be a high priority hotfix, and in the meantime a workaround procedure.

Hey there, my walk around solution is to install nvidia driver 520.56 first. When installing CUDA 11.8, follow every steps but change the very last step to sudo apt-get install nvidia-cuda-toolkit. This does not erase your local driver and prevent driver crash.

Other other way is to use Nvidia-docker released by Nvidia. In this case you dont need to install CUDA but only the nvidia driver. The Pytorch docker which includes CUDA can be found at PyTorch | NVIDIA NGC

1 Like

Hi Nvidia, don’t you think it is time for fixing this? What are you waiting for? My two systems are in completely unusable mode, and I don’t want to waste more time with workarounds…

3 Likes

When is Nvidia going to fix this problem?

@evdaccs @jwkb
Please confirm if you tried with driver 520.56.06.
This driver fixed issue for users reported on another thread.
[Bug Report] Black X11 Screen and partial lockup when upgraded to 515.76 and dual RTX3060 - Graphics / Linux / Linux - NVIDIA Developer Forums

@amrits

Please confirm if you tried with driver 520.56.06.
This driver fixed issue for users reported on another thread.
[Bug Report] Black X11 Screen and partial lockup when upgraded to 515.76 and dual RTX3060 - Graphics / Linux / Linux - NVIDIA Developer Forums

Thank you for your prompt response!

It looks like you shared a workaround, which I appreciate to become unblocked, but I do not think this can be considered a fix.

When is Nvidia going to fix this problem?

In other words, when can users expect to install cuda from official package sources and not have breakage?

sudo apt install cuda

Thanks again for your support!

@evdaccs Thank you very much, you are 100% right!

@amrits After 7 years successful use of CUDA on a couple of machines on our site, this is now the worst experience ever. And Nvidia is 100% responsible for this mess.
REMINDER: You announced on 10th of october - Issue has been root caused and fix is integrated in future release driver.
Why don’t you just build it and release it?

Please confirm if you tried with driver 520.56.06.
This driver fixed issue for users reported on another thread.

Wasn’t able to install this driver to even unblock myself by the way.
It’s not available.

What is the expectation on users here?

Here’s Nvidia’s cuda repo (focal):
https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/

Where is driver 520.56.06 ?

Okay folks,
It’s unknown when Nvidia will fix the problem.

After a few hours, I found downgrading through their archive to a functioning toolkit works.

So here is your workaround:


wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.7.1/local_installers/cuda-repo-ubuntu2004-11-7-local_11.7.1-515.65.01-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2004-11-7-local_11.7.1-515.65.01-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2004-11-7-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda

I was able to boot my computer again without needing to disconnect the HDMI cable during startup.

You can download driver using below link -
https://us.download.nvidia.com/XFree86/Linux-x86_64/520.56.06/NVIDIA-Linux-x86_64-520.56.06.run

1 Like

That works fine but is only drivers not cuda, also for 40 series the solution of installing cuda 11.7 doesn’t work because we need 11.8 or higher for ADA GPUs, so I guess we are stuck until CUDA 12.0 is released next year.

Is that right @amrits ?

Thanks

In order for me to run Pytorch models I had to create a container (cuda 11.8) and then I can use my GPUs 4090s for training or inference, keep in mind that since this GPUs are new and had new SM you may need to recompile Pytorch or other packages to support them, or the code wont run. I hope Nvidia hurry up and released CUDA 12 and all this is solved.

I want to point out as well that we are experiencing restarts when long training sessions with ubuntu 22.04 and new drivers, this is not happening with 20.04.

Thanks