In short, can’t stress the GPU on Ubuntu but it is working fine on Windows 11.
Long story:
I have a RTX A5000 24GB with 230W TDP. I am unable to stress the GPU on my Ubuntu 20.04.5 LTS. Whenever I run a machine learning model (4-5GB RAM on GPU) the screen hangs for 4-5 seconds and then the PC restarts. BTW my workstation has a MSI BIOS(UEFI) and there is no integrated Graphics.
To test the same, I installed Windows 11 on the same SSD (dual-boot), I stressed the GPU to 100% TDP with FurMark v1.31.0.0 GPU stress test benchmark tool and there is no problem. Here are the results.
I also ran multiple ML models which takes upto 20GB RAM on the GPU. There was no problem.
How is that happening?
Here is the output of nvidia-smi
on both the Ubuntu 20.04 and Windows 11 Home 21H2.
Sat Oct 1 13:20:55 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A5000 Off | 00000000:65:00.0 On | 0 |
| 30% 32C P8 14W / 230W | 198MiB / 23028MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1459 G /usr/lib/xorg/Xorg 39MiB |
| 0 N/A N/A 2458 G /usr/lib/xorg/Xorg 55MiB |
| 0 N/A N/A 2602 G /usr/bin/gnome-shell 92MiB |
+-----------------------------------------------------------------------------+
Sat Oct 1 12:26:46 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 517.40 Driver Version: 517.40 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A5000 WDDM | 00000000:65:00.0 On | 0 |
| 30% 42C P2 64W / 230W | 146MiB / 23028MiB | 1% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 7780 C+G C:\Windows\explorer.exe N/A |
| 0 N/A N/A 8732 C+G ...artMenuExperienceHost.exe N/A |
| 0 N/A N/A 8792 C+G ...n1h2txyewy\SearchHost.exe N/A |
| 0 N/A N/A 11088 C+G ...r\MSI_Network_Manager.exe N/A |
| 0 N/A N/A 11092 C+G ...SI\Fast Boot\FastBoot.exe N/A |
+-----------------------------------------------------------------------------+
The driver version is the latest on both OSes, the CUDA version is also same. Is there any debugging tool from NVIDIA to generate logs which I can post here?
I know there are some Windows8/10 specific settings in the BIOS. But I don’t think this is solving the problem by any means on Windows 11 which is not possible in Ubuntu.