After installing the latest cuda packages on my Ubuntu (5.15.0-48-generic) the graphics no longer works as it should, and the card became useless as a computing tool.
Symptoms:
Black screen, seems to to have problems with mode set
part of dmesg:
Superslow nvidia-smi (actually gives a result but only after more than 10 s)
nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 Off | N/A |
| 49% 63C P0 121W / 350W | 17MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Also note high power draw for no running processes!
xorg also seems to be spinning full throttle without generation of something useful:
inxi -t c
Processes:
CPU top: 5 of 356
1: cpu: 97.7% command: xorg pid: 4279
If I remove all nvidia/cuda stuff and then install nvidia-driver-510-server (Driver Version: 510.85.02) then the graphics come back.
However, it then now fails to get Tensorflow (2.11.0-dev20221005) to use the GPU due to âCould not load dynamic library âlibnvinfer.so.7ââ, and I am not able to get all the needed libraries together to get it back into the functioning state without apt dragging in the new nvidia drivers again.
I have attached a bug-report (which takes forever to complete, again probably because the modeset is slow or timing out before continuing) nvidia-bug-report.log.gz (440.4 KB)
In preparation for testing the next generation RTX4090 I upgraded to CUDA11.8 on two workstations with similar FATAL RESULT:
Black display - noticed with either Ubuntu 20.04 with RTX3090 as well with 22.04 and RTX3080Ti.
Both systems runnning kernel 5.15.0-48-generic
The workstation with Ubuntu 22.04 sometines rejects SSH-connects, top is showing 100% load of Nvida+, then Xorg and 2 minutes later of plymouthd.
Thanks, but still no success.
I just tried to switch from HDMI to Displayport on both systems, 20.04 and 22.04, still black display, via SSH top shows 100% load of Xorg even 10 min. after reboot.
Just wanted to add âme tooâ. This is on a clean, fresh Ubuntu 22.04 with RTX 3090. . CUDA 11.7/ 515.65.01 works perfectly. CUDA 11.8/520 fails to boot as described in your post.
amrits, can you describe the workaround so we can install 11.8? The current deb install isnât just unusable, it makes systems unbootable. Seems like a it should be a high priority hotfix, and in the meantime a workaround procedure.
Hey there, my walk around solution is to install nvidia driver 520.56 first. When installing CUDA 11.8, follow every steps but change the very last step to sudo apt-get install nvidia-cuda-toolkit. This does not erase your local driver and prevent driver crash.
Other other way is to use Nvidia-docker released by Nvidia. In this case you dont need to install CUDA but only the nvidia driver. The Pytorch docker which includes CUDA can be found at PyTorch | NVIDIA NGC
Hi Nvidia, donât you think it is time for fixing this?What are you waiting for? My two systems are in completely unusable mode, and I donât want to waste more time with workaroundsâŠ
@amrits After 7 years successful use of CUDA on a couple of machines on our site, this is now the worst experience ever. And Nvidia is 100% responsible for this mess.
REMINDER: You announced on 10th of october - Issue has been root caused and fix is integrated in future release driver.
Why donât you just build it and release it?
That works fine but is only drivers not cuda, also for 40 series the solution of installing cuda 11.7 doesnât work because we need 11.8 or higher for ADA GPUs, so I guess we are stuck until CUDA 12.0 is released next year.
In order for me to run Pytorch models I had to create a container (cuda 11.8) and then I can use my GPUs 4090s for training or inference, keep in mind that since this GPUs are new and had new SM you may need to recompile Pytorch or other packages to support them, or the code wont run. I hope Nvidia hurry up and released CUDA 12 and all this is solved.
I want to point out as well that we are experiencing restarts when long training sessions with ubuntu 22.04 and new drivers, this is not happening with 20.04.