Driver 455/460 loads in ubuntu but booting sequence hangs with purple screen

Build specs:

  • ryzen 5950x
  • RTX 3090 (gigabyte master)
  • 64GB ram
  • EVGA 1000w psu
  • ubuntu 20.04

After install either driver 455 or 460, boot into ubuntu will always ends with purple screen, I can still go to terminal with ctr+alt-f2. I will have to do apt purge nvidia* to get ubuntu boots successfully.

$ dkms status
nvidia, 455.45.01, 5.8.0-45-generic, x86_64: installed

$ uname -a
Linux badboy3 5.8.0-45-generic #51~20.04.1-Ubuntu SMP Tue Feb 23 13:46:31 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

(Currently on 4.55 but the following output is with driver 460 installed earlier)

$ nvidia-smi
Tue Mar 16 20:41:48 2021
| NVIDIA-SMI 460.39       Driver Version: 460.39       CUDA Version: 11.2     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  GeForce RTX 3090    Off  | 00000000:09:00.0  On |                  N/A |
|  0%   43C    P8    22W / 370W |     77MiB / 24259MiB |      0%      Default |
|                               |                      |                  N/A |

| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|    0   N/A  N/A      1202      G   /usr/lib/xorg/Xorg                 61MiB |
|    0   N/A  N/A      1469      G   /usr/bin/gnome-shell               14MiB |

I have blocked nouveau driver. Here is the output from Xorg.0.log: [ 9.950] (--) Log file renamed from "/var/log/" to "/var/lo -
output from dmesg: [Tue Mar 16 20:41:17 2021] nvidia: loading out-of-tree module taints kernel.[T -

I am not sure what causes this issue is or where to look for errors.

Please run as root and attach the resulting file here.

Here are the logs, for both driver 455 and 460

nvidia-bug-report455.ppa.after_reboot.log.gz (302.6 KB) nvidia-bug-report460.ppa.after_reboot.log.gz (288.0 KB)

→ Would you like to run the nvidia-xconfig utility to automatically update your X configuration file so that the NVIDIA X driver will be used when you restart X? Any pre-existing X configuration file will be backed up. (Answer: Yes)

You once installed the driver with the .run file installer. You created an xorg.conf.
Please remove it, if it’s still around (/etc/X11/xorg.conf).

[ 0.896283] … node #0, CPUs: #1
[ 0.500214] __common_interrupt: 1.55 No irq handler for vector

[ 5.901976] EDAC amd64: Error: F0 not found, device 0x1650 (broken BIOS?)

Also your kernel does not fully support your cpu. 5.10/5.11 might work better (not that I think that’s directly related to your graphical problem).

You boot with 2 monitors connected (2nd non-boot being your TV), might also try disconnecting the tv (just for debugging).

Thank you for your reply.

I forgot to check for the basic things (like connecting cables) before jumping into the conclusion that something is not working. Silly me.

I think the driver is working (yes, I uninstalled the .run driver before running the test). The card is connecting to 2 displays, the TV was powered off and I only turned on the 2nd monitor, I thought the card would ignore the TV and output the signal to the monitor instead. I didn’t not realize the main output was going to the TV the entire time. I am seeing purple screen because the card was treating the monitor as a secondary display. Thank you for pointing this out. I unplugged the HDMI cable from the TV, I see that everything seems to be working. I am trying to run some machine learning benchmark tests. Just want to make sure do you see any outstanding issues from the log files? Thanks again.

Nothing more than I already mentioned.

Looks like I ran into issue with the card when running some benchmark code

The card crashed the entire system, with the fans going nuts. See the video I posted below:

Like you mentioned you don’t seen anything from the logs, I am wondering if there is any test I can do in ubuntu to see if the card is faulty? This is a clean ubuntu install. I just installed the driver, CUDA and pytorch (and I guess the dependency packages) but nothing else.

Can you ssh into the machine after the crash and create a bug report?
Otherwise create a new regular report and also run journalctl -b-1 > journal.txt and attach that.

I really recommend installing kernel 5.10:

You could try cuda memcheck:

I have installed 5.10 and still experience the same issue with the card when running the benchmark code. The code hangs with the display signal goes out, but I could still ssh into the system from another terminal tab in my mac and created a bug report.

$ cuda-memcheck --log-file pytorch_test.log --save output.txt ./
benchmark start : 2021/03/18 10:24:03
Number of GPUs on current device : 1
CUDA Version : 11.1
Cudnn Version : 8005
Device Name : GeForce RTX 3090
uname_result(system='Linux', node='badboy3', release='5.10.0-051000-generic', version='#202012132330 SMP Sun Dec 13 23:33:36 UTC 2020', machine='x86_64', processor='x86_64')
                     scpufreq(current=3850.1365937499995, min=2200.0, max=3400.0)
                    cpu_count: 32
                    memory_available: 65369694208
Benchmarking Training float precision type mnasnet0_5
mnasnet0_5 model average train time : 16.601696014404297ms
Benchmarking Training float precision type mnasnet0_75

I have attached all the log files here.

nvidia-bug-report.log.gz (177.3 KB) journal.txt (514.9 KB) pytorch_test.log (4 KB)

[ 486.776778] NVRM: GPU at PCI:0000:09:00: GPU-a1309680-6633-0c45-f563-b583b17b5b57
[ 486.776781] NVRM: GPU Board Serial Number:
[ 486.776783] NVRM: Xid (PCI:0000:09:00): 79, pid=0, GPU has fallen off the bus.
[ 486.776785] NVRM: GPU 0000:09:00.0: GPU has fallen off the bus.
[ 486.776786] NVRM: GPU 0000:09:00.0: GPU is on Board .
[ 486.776799] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.

That’s the log output from what is happening.
According to:

The reason can be hardware, driver, or overheating.
Please check your cooling. If you can rule that out, there is a crash-dump in the bug report, that nvidia people could analyze.
As far as I see, nothing more I can do.

I’d rather guess this is a problem with the psu, please see this thread for some ideas:

Yes. Setting the gpu clock limits let me pass the benchmark code. Any pointer for where to look for potential issues with my psu?

Just want to nail down the issue here before I choose to RMA the card.

I guess there’s nothing wrong with the gpu card, just the specific combination of your 3090 model and psu is problematic. What kind of power connectors does that card have (12/8-Pin?). Is an adapter necessary to connect it to the psu?

The card has 2x 8-pin connectors, doesn’t need any adapter.

Then maybe try using different terminals on the psu.

Thanks for the suggestion. I tried out a few vga terminals on the PSU, don’t think it made any difference - still crash on the benchmark code. Maybe I will look for the offensive line in the code that led to the crash.