Unable to stress NVIDIA RTX A5000 on Ubuntu

In short, can’t stress the GPU on Ubuntu but it is working fine on Windows 11.

Long story:

I have a RTX A5000 24GB with 230W TDP. I am unable to stress the GPU on my Ubuntu 20.04.5 LTS. Whenever I run a machine learning model (4-5GB RAM on GPU) the screen hangs for 4-5 seconds and then the PC restarts. BTW my workstation has a MSI BIOS(UEFI) and there is no integrated Graphics.

To test the same, I installed Windows 11 on the same SSD (dual-boot), I stressed the GPU to 100% TDP with FurMark v1.31.0.0 GPU stress test benchmark tool and there is no problem. Here are the results.

I also ran multiple ML models which takes upto 20GB RAM on the GPU. There was no problem.

How is that happening?

Here is the output of nvidia-smi on both the Ubuntu 20.04 and Windows 11 Home 21H2.

Sat Oct  1 13:20:55 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A5000    Off  | 00000000:65:00.0  On |                    0 |
| 30%   32C    P8    14W / 230W |    198MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1459      G   /usr/lib/xorg/Xorg                 39MiB |
|    0   N/A  N/A      2458      G   /usr/lib/xorg/Xorg                 55MiB |
|    0   N/A  N/A      2602      G   /usr/bin/gnome-shell               92MiB |
+-----------------------------------------------------------------------------+
Sat Oct  1 12:26:46 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 517.40       Driver Version: 517.40       CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A5000   WDDM  | 00000000:65:00.0  On |                    0 |
| 30%   42C    P2    64W / 230W |    146MiB / 23028MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      7780    C+G   C:\Windows\explorer.exe         N/A      |
|    0   N/A  N/A      8732    C+G   ...artMenuExperienceHost.exe    N/A      |
|    0   N/A  N/A      8792    C+G   ...n1h2txyewy\SearchHost.exe    N/A      |
|    0   N/A  N/A     11088    C+G   ...r\MSI_Network_Manager.exe    N/A      |
|    0   N/A  N/A     11092    C+G   ...SI\Fast Boot\FastBoot.exe    N/A      |
+-----------------------------------------------------------------------------+

The driver version is the latest on both OSes, the CUDA version is also same. Is there any debugging tool from NVIDIA to generate logs which I can post here?

I know there are some Windows8/10 specific settings in the BIOS. But I don’t think this is solving the problem by any means on Windows 11 which is not possible in Ubuntu.

Nvidia Linux driver boost behaviour is different from Windows so it’s more likely to run into PSU issues (the mainboard then resets). Please try limiting clocks using nvidia-smi -lgc

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.