MSI 3090 GPU causes a full system crash when under any sort of load -- Ubuntu 18.04LTS

Have a Ubuntu 18.04LTS platform running on a Lenovo P920 workstation. The system has an MSI 3090 GPU and a 2080RTX card, and has a dual CPU setup and a 1400W PSU. System specs to follow.

The issue we’re having is that whenever we initialize the 3090 GPU with any meaningful tasking, the system will freeze up and then crash hard - faster than any logs get generated indicating why. It’s reacting as if it’s a power issue (which could certainly be the case). However, when we install a Zotac 3090 GPU, the system can handle the load without problem.

My best guess might be that the GPU spikes a request to the PSU in a high wattage range, which then causes the system to safety-shutoff. We’ve reinstalled the NVIDIA drivers, reinstalled cuda reinstalled docker and a bunch else but the problem persists. I’m hoping someone might be able to review my bug-report log and advise on what (if anything) might be up with this system, or provide some potential next-steps for troubleshooting further and isolating what the problem is.

We’re able to run stress with maxed out cores at capacity with no issue (even with the problem GPU installed, as well as (when just the 2080RTX card is installed, or the 2080 + the zotac card is installed) a full stress test using luxmark for bench/maximum load. This same GPU strain test will cause the crash to occur with the MSI 3090 GPU installed.

Thanks very much in advance for all your time/help/consideration. Starting to think that maybe the MSI 3090 simply isn’t compatible with the linux drivers.

Our next step is going to be a full OS restore to 20.04LTS and reinstall clean to absolutely 100% validate that things are working the way they should be, or I suppose we’ll figure out how to sell/return this MSI card and buy another Zotac.

(In case you’re curious, I’ve also tried limiting the power access to something more reasonable sub 300W:
Limit power usage
sudo nvidia-smi --persistence-mode=1
sudo nvidia-smi --power-limit=280

this did not work, crash occurs during the spin-up of the GPU when it’s only pushing 100W (idle at 13W is fine)
(I watched the output of the system with the following script to pull live feed on wattage:)
nvidia-smi --loop-ms=20 --format=csv,noheader,nounits --query-gpu=power.draw > out.txt tail -f out.txt

At most it’ll hit 100W (give or take) before the system terminates. It doesn’t register or log a spike above that before it crashes, so either it’s happening in the tick before it logs or the spike isn’t the issue.

I also validated heat, monitoring the temp reported by the GPUs before the system powered off. Neither GPU even had time to get hot from spin up before we crashed. (it’s close to immediate that this issue will occur, so we’re still basically at idle temps - maybe 33C – 40C before dying.

Thanks in advance for your time and consideration.


nvidia-bug-report.log.gz (628.2 KB)

System report/details:

2x Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz
192GB RAM (32GBx6)
1TB internal OS SSD + 2TB SATA drive onboard storage
nvidia driver version: 460.32.03 CUDA Version: 11.2
(nvidia drivers installed from:
sudo apt-get install ubuntu-drivers-common
&& sudo ubuntu-drivers autoinstall

$ cat /etc/*release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION=“Ubuntu 18.04.5 LTS”
NAME=“Ubuntu”
VERSION=“18.04.5 LTS (Bionic Beaver)”
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME=“Ubuntu 18.04.5 LTS”
VERSION_ID=“18.04”

$nvidia-smi
Tue Apr 13 15:19:51 2021
±----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 3090 Off | 00000000:3B:00.0 Off | N/A |
| 0% 33C P8 13W / 370W | 5MiB / 24268MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 GeForce RTX 208… Off | 00000000:AF:00.0 Off | N/A |
| 27% 28C P8 14W / 250W | 86MiB / 11017MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2083 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 2083 G /usr/lib/xorg/Xorg 27MiB |
| 1 N/A N/A 2479 G /usr/bin/gnome-shell 56MiB |
±----------------------------------------------------------------------------+

$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 40
On-line CPU(s) list: 0-39
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz
Stepping: 7
CPU MHz: 1000.481
CPU max MHz: 3200.0000
CPU min MHz: 1000.0000
BogoMIPS: 4400.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 14080K
NUMA node0 CPU(s): 0-9,20-29
NUMA node1 CPU(s): 10-19,30-39
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush d
ts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs
bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx
est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer ae
s xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_p
pin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_a
djust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb
intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total
cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke avx512_vnni md
_clear flush_l1d arch_capabilities

Please let me know if there is additional data I can provide that might be helpful, or if you have any thoughts on what to try next besides seeing how far I can punt the card.

Thanks!

To prevent power spikes from gpu boost, it’s more efficient to limits clocks, e.g.
nvidia-smi -lgc 300,1800

What psu brand/model is built into the system?

Hey Generix! Thanks for taking a look, I appreciate it!

The system is a standard Lenovo P920 workstation, so it’s their form-factor model that comes built into the unit at 1400W. Here’s a document on the full details - page 9 seems to indicate the power load supported:

The system does indicate it has a restricted line in mode, but we’re powering the system off a standard PDU rail in a full-load capacity server room with backup power, so input power is not a problem. (Though if you’ve got a suggested tool for checking what the draw is I’ll validate).

I’ll try that clock limiting today and see if that works. I reached out to MSI yesterday as well and of course their response was a very helpful “if you want it to work, install it on windows, or ask Nvidia, they make the drivers”. (hah)

I’m willing to load up a local instance of windows 10 just to try and see if the win drivers resolve the error and validate the hardware configuration. The main difference I can tell between the Zotac card and this MSI card is the 3rd 6-pin input on the board (compared to 2 on the zotac) and a higher wattage drain at max load on the MSI than the zotac by about 40W.

Thanks for the assist, I’ll let you know how the clock limits go, and if we decide to set up a fresh installation or not. (I’ll include a new bug-report.log.gz if we do that + updated build details).

The correct entity to talk to in those kind of cases is actually the vendor of the psu, so in this case Lenovo, if their psu supports that kind of 3090 with 3x 6-pin inputs. The power specs you provided are quite detailed though also a bit confusing. Did you connect the extra mainboard gpu power connector to the psu?
Sounds like the MSI might be an OC model with extra power requirements.

That makes sense, I’ll reach out to them this AM and see if I can’t get a straight answer from their support team about it - I know when we ordered the P920’s the engineer group wasn’t 100% on whether or not they’d run so it’s entirely plausible that they’re simply outclassing the PSU’s supported outputs.

We do have all three connectors hooked up, though we are using a splitter assembly so the bottom two 6+2 rail pins are supporting all 18 inputs - this may be part of the problem, if the 2 cables aren’t enough to support the card fully. This morning I may see if we can use a splitter to pull from the top bay rail (removing the 2080 from the mix) and seeing if the load is distributed better in that configuration if it’ll stretch to the card.

I’ll get back to you when I’ve heard more from the vendor - it may be that the bottom rail by itself can’t meet the full wattage output. Based on that link I supplied, there’s a support wattage cap that indicates the following:

GPU Support: (250W x 3) OR (235W x 3) OR (180W x 3) OR (150W x 3) OR (140W x 4) OR (120W x 4)

So while the total Wattage pull from the GPUs is less than that ‘max’ 250W x3 it’s maybe more than the two pin outs can supply without pulling from a second rail…

Let’s see what Lenovo says.

Okay, after some splitter power cable rearranging, we have determined that: The card requires power input from the top rail. The bottom two rail pins are simply not powerful enough to handle the card’s draw, but when we linked the top pins in, it works great.

This means that we can run the 3090 MSI card, but sacrifice having a GPU in the top bay. (Or we’ll need to downgrade it to a 1080 or something with lesser power requirements and find the balance there.)

Fascinating. Thanks very much for helping me bounce ideas back and forth, I’m going to mark your answer the correct solution, as it was indeed a power draw issue, and now that we are beginning to isolate exactly where that line is at a minimum we can consider this closed with a power boost workaround. Presumably we could look into external PSU’s but I get real nervous with those bridge cards, might be more efficient to simply buy a beefier system to handle the 3090’s+ we end up buying moving forward.

Thanks for your help today!