Driver 455/460 loads in ubuntu but booting sequence hangs with purple screen

Build specs:

  • ROG CROSSHAIR VIII DARK HERO
  • ryzen 5950x
  • RTX 3090 (gigabyte master)
  • 64GB ram
  • EVGA 1000w psu
  • ubuntu 20.04

After install either driver 455 or 460, boot into ubuntu will always ends with purple screen, I can still go to terminal with ctr+alt-f2. I will have to do apt purge nvidia* to get ubuntu boots successfully.

$ dkms status
nvidia, 455.45.01, 5.8.0-45-generic, x86_64: installed

$ uname -a
Linux badboy3 5.8.0-45-generic #51~20.04.1-Ubuntu SMP Tue Feb 23 13:46:31 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

(Currently on 4.55 but the following output is with driver 460 installed earlier)

$ nvidia-smi
Tue Mar 16 20:41:48 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39       Driver Version: 460.39       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 3090    Off  | 00000000:09:00.0  On |                  N/A |
|  0%   43C    P8    22W / 370W |     77MiB / 24259MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1202      G   /usr/lib/xorg/Xorg                 61MiB |
|    0   N/A  N/A      1469      G   /usr/bin/gnome-shell               14MiB |
+-----------------------------------------------------------------------------+

I have blocked nouveau driver. Here is the output from Xorg.0.log: [ 9.950] (--) Log file renamed from "/var/log/Xorg.pid-1202.log" to "/var/lo - Pastebin.com
output from dmesg: [Tue Mar 16 20:41:17 2021] nvidia: loading out-of-tree module taints kernel.[T - Pastebin.com

I am not sure what causes this issue is or where to look for errors.

Please run nvidia-bug-report.sh as root and attach the resulting file here.

Here are the logs, for both driver 455 and 460

nvidia-bug-report455.ppa.after_reboot.log.gz (302.6 KB) nvidia-bug-report460.ppa.after_reboot.log.gz (288.0 KB)

→ Would you like to run the nvidia-xconfig utility to automatically update your X configuration file so that the NVIDIA X driver will be used when you restart X? Any pre-existing X configuration file will be backed up. (Answer: Yes)

You once installed the driver with the .run file installer. You created an xorg.conf.
Please remove it, if it’s still around (/etc/X11/xorg.conf).

[ 0.896283] … node #0, CPUs: #1
[ 0.500214] __common_interrupt: 1.55 No irq handler for vector

[ 5.901976] EDAC amd64: Error: F0 not found, device 0x1650 (broken BIOS?)

Also your kernel does not fully support your cpu. 5.10/5.11 might work better (not that I think that’s directly related to your graphical problem).

You boot with 2 monitors connected (2nd non-boot being your TV), might also try disconnecting the tv (just for debugging).

Thank you for your reply.

I forgot to check for the basic things (like connecting cables) before jumping into the conclusion that something is not working. Silly me.

I think the driver is working (yes, I uninstalled the .run driver before running the test). The card is connecting to 2 displays, the TV was powered off and I only turned on the 2nd monitor, I thought the card would ignore the TV and output the signal to the monitor instead. I didn’t not realize the main output was going to the TV the entire time. I am seeing purple screen because the card was treating the monitor as a secondary display. Thank you for pointing this out. I unplugged the HDMI cable from the TV, I see that everything seems to be working. I am trying to run some machine learning benchmark tests. Just want to make sure do you see any outstanding issues from the log files? Thanks again.

Nothing more than I already mentioned.

Looks like I ran into issue with the card when running some benchmark code

The card crashed the entire system, with the fans going nuts. See the video I posted below:

Like you mentioned you don’t seen anything from the logs, I am wondering if there is any test I can do in ubuntu to see if the card is faulty? This is a clean ubuntu install. I just installed the driver, CUDA and pytorch (and I guess the dependency packages) but nothing else.

Can you ssh into the machine after the crash and create a bug report?
Otherwise create a new regular report and also run journalctl -b-1 > journal.txt and attach that.

I really recommend installing kernel 5.10:

You could try cuda memcheck:
https://docs.nvidia.com/cuda/cuda-memcheck/index.html

I have installed 5.10 and still experience the same issue with the card when running the benchmark code. The code hangs with the display signal goes out, but I could still ssh into the system from another terminal tab in my mac and created a bug report.

$ cuda-memcheck --log-file pytorch_test.log --save output.txt ./test.sh
start
benchmark start : 2021/03/18 10:24:03
Number of GPUs on current device : 1
CUDA Version : 11.1
Cudnn Version : 8005
Device Name : GeForce RTX 3090
uname_result(system='Linux', node='badboy3', release='5.10.0-051000-generic', version='#202012132330 SMP Sun Dec 13 23:33:36 UTC 2020', machine='x86_64', processor='x86_64')
                     scpufreq(current=3850.1365937499995, min=2200.0, max=3400.0)
                    cpu_count: 32
                    memory_available: 65369694208
Benchmarking Training float precision type mnasnet0_5
mnasnet0_5 model average train time : 16.601696014404297ms
Benchmarking Training float precision type mnasnet0_75

I have attached all the log files here.

nvidia-bug-report.log.gz (177.3 KB) journal.txt (514.9 KB) pytorch_test.log (4 KB)

[ 486.776778] NVRM: GPU at PCI:0000:09:00: GPU-a1309680-6633-0c45-f563-b583b17b5b57
[ 486.776781] NVRM: GPU Board Serial Number:
[ 486.776783] NVRM: Xid (PCI:0000:09:00): 79, pid=0, GPU has fallen off the bus.
[ 486.776785] NVRM: GPU 0000:09:00.0: GPU has fallen off the bus.
[ 486.776786] NVRM: GPU 0000:09:00.0: GPU is on Board .
[ 486.776799] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.

That’s the log output from what is happening.
According to:
https://docs.nvidia.com/deploy/xid-errors/index.html

The reason can be hardware, driver, or overheating.
Please check your cooling. If you can rule that out, there is a crash-dump in the bug report, that nvidia people could analyze.
As far as I see, nothing more I can do.

1 Like

I’d rather guess this is a problem with the psu, please see this thread for some ideas:
https://forums.developer.nvidia.com/t/pc-restarting-when-trainning-dl-model/168941

Yes. Setting the gpu clock limits let me pass the benchmark code. Any pointer for where to look for potential issues with my psu?
https://www.evga.com/products/product.aspx?pn=210-GQ-1000-V1

Just want to nail down the issue here before I choose to RMA the card.

I guess there’s nothing wrong with the gpu card, just the specific combination of your 3090 model and psu is problematic. What kind of power connectors does that card have (12/8-Pin?). Is an adapter necessary to connect it to the psu?

The card has 2x 8-pin connectors, doesn’t need any adapter.

Then maybe try using different terminals on the psu.

Thanks for the suggestion. I tried out a few vga terminals on the PSU, don’t think it made any difference - still crash on the benchmark code. Maybe I will look for the offensive line in the code that led to the crash.