After install either driver 455 or 460, boot into ubuntu will always ends with purple screen, I can still go to terminal with ctr+alt-f2. I will have to do apt purge nvidia* to get ubuntu boots successfully.
$ dkms status
nvidia, 455.45.01, 5.8.0-45-generic, x86_64: installed
$ uname -a
Linux badboy3 5.8.0-45-generic #51~20.04.1-Ubuntu SMP Tue Feb 23 13:46:31 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
(Currently on 4.55 but the following output is with driver 460 installed earlier)
$ nvidia-smi
Tue Mar 16 20:41:48 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39 Driver Version: 460.39 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 3090 Off | 00000000:09:00.0 On | N/A |
| 0% 43C P8 22W / 370W | 77MiB / 24259MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1202 G /usr/lib/xorg/Xorg 61MiB |
| 0 N/A N/A 1469 G /usr/bin/gnome-shell 14MiB |
+-----------------------------------------------------------------------------+
→ Would you like to run the nvidia-xconfig utility to automatically update your X configuration file so that the NVIDIA X driver will be used when you restart X? Any pre-existing X configuration file will be backed up. (Answer: Yes)
You once installed the driver with the .run file installer. You created an xorg.conf.
Please remove it, if it’s still around (/etc/X11/xorg.conf).
[ 0.896283] … node #0, CPUs: #1
[ 0.500214] __common_interrupt: 1.55 No irq handler for vector
…
[ 5.901976] EDAC amd64: Error: F0 not found, device 0x1650 (broken BIOS?)
…
Also your kernel does not fully support your cpu. 5.10/5.11 might work better (not that I think that’s directly related to your graphical problem).
You boot with 2 monitors connected (2nd non-boot being your TV), might also try disconnecting the tv (just for debugging).
I forgot to check for the basic things (like connecting cables) before jumping into the conclusion that something is not working. Silly me.
I think the driver is working (yes, I uninstalled the .run driver before running the test). The card is connecting to 2 displays, the TV was powered off and I only turned on the 2nd monitor, I thought the card would ignore the TV and output the signal to the monitor instead. I didn’t not realize the main output was going to the TV the entire time. I am seeing purple screen because the card was treating the monitor as a secondary display. Thank you for pointing this out. I unplugged the HDMI cable from the TV, I see that everything seems to be working. I am trying to run some machine learning benchmark tests. Just want to make sure do you see any outstanding issues from the log files? Thanks again.
Looks like I ran into issue with the card when running some benchmark code
The card crashed the entire system, with the fans going nuts. See the video I posted below:
Like you mentioned you don’t seen anything from the logs, I am wondering if there is any test I can do in ubuntu to see if the card is faulty? This is a clean ubuntu install. I just installed the driver, CUDA and pytorch (and I guess the dependency packages) but nothing else.
Can you ssh into the machine after the crash and create a bug report?
Otherwise create a new regular report and also run journalctl -b-1 > journal.txt and attach that.
I have installed 5.10 and still experience the same issue with the card when running the benchmark code. The code hangs with the display signal goes out, but I could still ssh into the system from another terminal tab in my mac and created a bug report.
$ cuda-memcheck --log-file pytorch_test.log --save output.txt ./test.sh
start
benchmark start : 2021/03/18 10:24:03
Number of GPUs on current device : 1
CUDA Version : 11.1
Cudnn Version : 8005
Device Name : GeForce RTX 3090
uname_result(system='Linux', node='badboy3', release='5.10.0-051000-generic', version='#202012132330 SMP Sun Dec 13 23:33:36 UTC 2020', machine='x86_64', processor='x86_64')
scpufreq(current=3850.1365937499995, min=2200.0, max=3400.0)
cpu_count: 32
memory_available: 65369694208
Benchmarking Training float precision type mnasnet0_5
mnasnet0_5 model average train time : 16.601696014404297ms
Benchmarking Training float precision type mnasnet0_75
[ 486.776778] NVRM: GPU at PCI:0000:09:00: GPU-a1309680-6633-0c45-f563-b583b17b5b57
[ 486.776781] NVRM: GPU Board Serial Number:
[ 486.776783] NVRM: Xid (PCI:0000:09:00): 79, pid=0, GPU has fallen off the bus.
[ 486.776785] NVRM: GPU 0000:09:00.0: GPU has fallen off the bus.
[ 486.776786] NVRM: GPU 0000:09:00.0: GPU is on Board .
[ 486.776799] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
The reason can be hardware, driver, or overheating.
Please check your cooling. If you can rule that out, there is a crash-dump in the bug report, that nvidia people could analyze.
As far as I see, nothing more I can do.
I guess there’s nothing wrong with the gpu card, just the specific combination of your 3090 model and psu is problematic. What kind of power connectors does that card have (12/8-Pin?). Is an adapter necessary to connect it to the psu?
Thanks for the suggestion. I tried out a few vga terminals on the PSU, don’t think it made any difference - still crash on the benchmark code. Maybe I will look for the offensive line in the code that led to the crash.