Driver 455/460 loads in ubuntu but booting sequence hangs with purple screen

serigeks · March 17, 2021, 5:38am

Build specs:

ROG CROSSHAIR VIII DARK HERO
ryzen 5950x
RTX 3090 (gigabyte master)
64GB ram
EVGA 1000w psu
ubuntu 20.04

After install either driver 455 or 460, boot into ubuntu will always ends with purple screen, I can still go to terminal with ctr+alt-f2. I will have to do apt purge nvidia* to get ubuntu boots successfully.

$ dkms status
nvidia, 455.45.01, 5.8.0-45-generic, x86_64: installed

$ uname -a
Linux badboy3 5.8.0-45-generic #51~20.04.1-Ubuntu SMP Tue Feb 23 13:46:31 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

(Currently on 4.55 but the following output is with driver 460 installed earlier)

$ nvidia-smi
Tue Mar 16 20:41:48 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39       Driver Version: 460.39       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 3090    Off  | 00000000:09:00.0  On |                  N/A |
|  0%   43C    P8    22W / 370W |     77MiB / 24259MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1202      G   /usr/lib/xorg/Xorg                 61MiB |
|    0   N/A  N/A      1469      G   /usr/bin/gnome-shell               14MiB |
+-----------------------------------------------------------------------------+

I have blocked nouveau driver. Here is the output from Xorg.0.log: [ 9.950] (--) Log file renamed from "/var/log/Xorg.pid-1202.log" to "/var/lo - Pastebin.com
output from dmesg: [Tue Mar 16 20:41:17 2021] nvidia: loading out-of-tree module taints kernel.[T - Pastebin.com

I am not sure what causes this issue is or where to look for errors.

Mart · March 17, 2021, 7:56am

Please run nvidia-bug-report.sh as root and attach the resulting file here.

serigeks · March 17, 2021, 5:59pm

Here are the logs, for both driver 455 and 460

nvidia-bug-report455.ppa.after_reboot.log.gz (302.6 KB) nvidia-bug-report460.ppa.after_reboot.log.gz (288.0 KB)

Mart · March 17, 2021, 6:48pm

→ Would you like to run the nvidia-xconfig utility to automatically update your X configuration file so that the NVIDIA X driver will be used when you restart X? Any pre-existing X configuration file will be backed up. (Answer: Yes)

You once installed the driver with the .run file installer. You created an xorg.conf.
Please remove it, if it’s still around (/etc/X11/xorg.conf).

[ 0.896283] … node #0, CPUs: #1
[ 0.500214] __common_interrupt: 1.55 No irq handler for vector
…
[ 5.901976] EDAC amd64: Error: F0 not found, device 0x1650 (broken BIOS?)
…

Also your kernel does not fully support your cpu. 5.10/5.11 might work better (not that I think that’s directly related to your graphical problem).

You boot with 2 monitors connected (2nd non-boot being your TV), might also try disconnecting the tv (just for debugging).

serigeks · March 17, 2021, 7:53pm

Thank you for your reply.

I forgot to check for the basic things (like connecting cables) before jumping into the conclusion that something is not working. Silly me.

I think the driver is working (yes, I uninstalled the .run driver before running the test). The card is connecting to 2 displays, the TV was powered off and I only turned on the 2nd monitor, I thought the card would ignore the TV and output the signal to the monitor instead. I didn’t not realize the main output was going to the TV the entire time. I am seeing purple screen because the card was treating the monitor as a secondary display. Thank you for pointing this out. I unplugged the HDMI cable from the TV, I see that everything seems to be working. I am trying to run some machine learning benchmark tests. Just want to make sure do you see any outstanding issues from the log files? Thanks again.

Mart · March 17, 2021, 10:51pm

Nothing more than I already mentioned.

serigeks · March 18, 2021, 1:14am

Looks like I ran into issue with the card when running some benchmark code

The card crashed the entire system, with the fans going nuts. See the video I posted below:

Like you mentioned you don’t seen anything from the logs, I am wondering if there is any test I can do in ubuntu to see if the card is faulty? This is a clean ubuntu install. I just installed the driver, CUDA and pytorch (and I guess the dependency packages) but nothing else.

Mart · March 18, 2021, 12:01pm

Can you ssh into the machine after the crash and create a bug report?
Otherwise create a new regular report and also run journalctl -b-1 > journal.txt and attach that.

I really recommend installing kernel 5.10:

You could try cuda memcheck:

serigeks · March 18, 2021, 5:38pm

I have installed 5.10 and still experience the same issue with the card when running the benchmark code. The code hangs with the display signal goes out, but I could still ssh into the system from another terminal tab in my mac and created a bug report.

$ cuda-memcheck --log-file pytorch_test.log --save output.txt ./test.sh
start
benchmark start : 2021/03/18 10:24:03
Number of GPUs on current device : 1
CUDA Version : 11.1
Cudnn Version : 8005
Device Name : GeForce RTX 3090
uname_result(system='Linux', node='badboy3', release='5.10.0-051000-generic', version='#202012132330 SMP Sun Dec 13 23:33:36 UTC 2020', machine='x86_64', processor='x86_64')
                     scpufreq(current=3850.1365937499995, min=2200.0, max=3400.0)
                    cpu_count: 32
                    memory_available: 65369694208
Benchmarking Training float precision type mnasnet0_5
mnasnet0_5 model average train time : 16.601696014404297ms
Benchmarking Training float precision type mnasnet0_75

I have attached all the log files here.

nvidia-bug-report.log.gz (177.3 KB) journal.txt (514.9 KB) pytorch_test.log (4 KB)

Mart · March 18, 2021, 6:56pm

[ 486.776778] NVRM: GPU at PCI:0000:09:00: GPU-a1309680-6633-0c45-f563-b583b17b5b57
[ 486.776781] NVRM: GPU Board Serial Number:
[ 486.776783] NVRM: Xid (PCI:0000:09:00): 79, pid=0, GPU has fallen off the bus.
[ 486.776785] NVRM: GPU 0000:09:00.0: GPU has fallen off the bus.
[ 486.776786] NVRM: GPU 0000:09:00.0: GPU is on Board .
[ 486.776799] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.

That’s the log output from what is happening.
According to:
https://docs.nvidia.com/deploy/xid-errors/index.html

The reason can be hardware, driver, or overheating.
Please check your cooling. If you can rule that out, there is a crash-dump in the bug report, that nvidia people could analyze.
As far as I see, nothing more I can do.

generix · March 19, 2021, 9:50am

I’d rather guess this is a problem with the psu, please see this thread for some ideas:
https://forums.developer.nvidia.com/t/pc-restarting-when-trainning-dl-model/168941

serigeks · March 19, 2021, 5:28pm

Yes. Setting the gpu clock limits let me pass the benchmark code. Any pointer for where to look for potential issues with my psu?

Just want to nail down the issue here before I choose to RMA the card.

generix · March 19, 2021, 9:37pm

I guess there’s nothing wrong with the gpu card, just the specific combination of your 3090 model and psu is problematic. What kind of power connectors does that card have (12/8-Pin?). Is an adapter necessary to connect it to the psu?

serigeks · March 20, 2021, 6:12am

The card has 2x 8-pin connectors, doesn’t need any adapter.

generix · March 20, 2021, 5:15pm

Then maybe try using different terminals on the psu.

serigeks · March 22, 2021, 5:14am

Thanks for the suggestion. I tried out a few vga terminals on the PSU, don’t think it made any difference - still crash on the benchmark code. Maybe I will look for the offensive line in the code that led to the crash.

Topic		Replies	Views
NVIDIA driver 440.59 doesnt work/HDMI not appearing/NVIDIA X Server Settings opens a blank square + Ubuntu 19.10 Linux	8	3338	June 6, 2020
Arbitrary Crashes / Segfaults with RTX 3070 on current driver-455 on Ubuntu 20.04 kernel 5.4.0-58-generic Linux	23	2117	February 25, 2021
Nvidia-driver-460 on Ubuntu 20.04: NVIDIA driver is not loaded Linux	49	19509	August 23, 2021
Can't boot after upgrading to 510 driver on Ubuntu 20.04 Linux boot , kernel , ubuntu , nvbugs	28	39541	July 15, 2022
High CPU usage on xorg when the external monitor is plugged in Linux	120	37520	June 21, 2023
Ubuntu 22, prime-select or nvidia X settings do not change the mode from on-demand to performance, rtx3060 laptop Linux	14	4271	March 6, 2024
Xid 61 (black screen on startup) Ubuntu 18.04 GTX 1060 mobile Linux	12	3360	August 11, 2020
Black screen when resuming systemctl-suspend, using nvidia-driver-470.57.02 with kernel 5.8.0-63-generic on GTX 970, xubuntu 20.04 LTS Linux	67	26283	February 17, 2022
Desktop computer won't turn on after an update when the GPU is connected Linux	7	320	May 17, 2024
Reproducible: NVRM: GPU at 0000:01:00.0 has fallen off the bus. -- Both screens black, Xorg at 100% Linux	24	50903	December 16, 2015

Driver 455/460 loads in ubuntu but booting sequence hangs with purple screen

Related topics