Have a Ubuntu 18.04LTS platform running on a Lenovo P920 workstation. The system has an MSI 3090 GPU and a 2080RTX card, and has a dual CPU setup and a 1400W PSU. System specs to follow.
The issue we’re having is that whenever we initialize the 3090 GPU with any meaningful tasking, the system will freeze up and then crash hard - faster than any logs get generated indicating why. It’s reacting as if it’s a power issue (which could certainly be the case). However, when we install a Zotac 3090 GPU, the system can handle the load without problem.
My best guess might be that the GPU spikes a request to the PSU in a high wattage range, which then causes the system to safety-shutoff. We’ve reinstalled the NVIDIA drivers, reinstalled cuda reinstalled docker and a bunch else but the problem persists. I’m hoping someone might be able to review my bug-report log and advise on what (if anything) might be up with this system, or provide some potential next-steps for troubleshooting further and isolating what the problem is.
We’re able to run stress with maxed out cores at capacity with no issue (even with the problem GPU installed, as well as (when just the 2080RTX card is installed, or the 2080 + the zotac card is installed) a full stress test using luxmark for bench/maximum load. This same GPU strain test will cause the crash to occur with the MSI 3090 GPU installed.
Thanks very much in advance for all your time/help/consideration. Starting to think that maybe the MSI 3090 simply isn’t compatible with the linux drivers.
Our next step is going to be a full OS restore to 20.04LTS and reinstall clean to absolutely 100% validate that things are working the way they should be, or I suppose we’ll figure out how to sell/return this MSI card and buy another Zotac.
(In case you’re curious, I’ve also tried limiting the power access to something more reasonable sub 300W:
Limit power usage
sudo nvidia-smi --persistence-mode=1
sudo nvidia-smi --power-limit=280
this did not work, crash occurs during the spin-up of the GPU when it’s only pushing 100W (idle at 13W is fine)
(I watched the output of the system with the following script to pull live feed on wattage:)
$ nvidia-smi --loop-ms=20 --format=csv,noheader,nounits --query-gpu=power.draw > out.txt
$ tail -f out.txt
At most it’ll hit 100W (give or take) before the system terminates. It doesn’t register or log a spike above that before it crashes, so either it’s happening in the tick before it logs or the spike isn’t the issue.
I also validated heat, monitoring the temp reported by the GPUs before the system powered off. Neither GPU even had time to get hot from spin up before we crashed. (it’s close to immediate that this issue will occur, so we’re still basically at idle temps - maybe 33C – 40C before dying.
Thanks in advance for your time and consideration.
nvidia-bug-report.log.gz (628.2 KB)
System report/details:
2x Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz
192GB RAM (32GBx6)
1TB internal OS SSD + 2TB SATA drive onboard storage
nvidia driver version: 460.32.03 CUDA Version: 11.2
(nvidia drivers installed from:
sudo apt-get install ubuntu-drivers-common
&& sudo ubuntu-drivers autoinstall$ cat /etc/*release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION=“Ubuntu 18.04.5 LTS”
NAME=“Ubuntu”
VERSION=“18.04.5 LTS (Bionic Beaver)”
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME=“Ubuntu 18.04.5 LTS”
VERSION_ID=“18.04”$nvidia-smi
Tue Apr 13 15:19:51 2021
±----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 3090 Off | 00000000:3B:00.0 Off | N/A |
| 0% 33C P8 13W / 370W | 5MiB / 24268MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 GeForce RTX 208… Off | 00000000:AF:00.0 Off | N/A |
| 27% 28C P8 14W / 250W | 86MiB / 11017MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2083 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 2083 G /usr/lib/xorg/Xorg 27MiB |
| 1 N/A N/A 2479 G /usr/bin/gnome-shell 56MiB |
±----------------------------------------------------------------------------+$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 40
On-line CPU(s) list: 0-39
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz
Stepping: 7
CPU MHz: 1000.481
CPU max MHz: 3200.0000
CPU min MHz: 1000.0000
BogoMIPS: 4400.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 14080K
NUMA node0 CPU(s): 0-9,20-29
NUMA node1 CPU(s): 10-19,30-39
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush d
ts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs
bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx
est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer ae
s xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_p
pin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_a
djust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb
intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total
cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke avx512_vnni md
_clear flush_l1d arch_capabilities
Please let me know if there is additional data I can provide that might be helpful, or if you have any thoughts on what to try next besides seeing how far I can punt the card.
Thanks!