Systeme crash after "nvidia-smi" command. Rhel7.6/A100 GPU

adrien.paris · June 9, 2021, 2:29pm

Hello,

I have 2 new A100 in my server wish is running on rhel7.6.
After some time of uptime (like 1hour) everytime i try to use nvidia-smi my systeme reboot.

When i’m lucky and just after a reboot, i can have this, and i dont know why le GPU 1 is used at 15% and sometimes at 100% for no reason.


nvidia-smi
Wed Jun  9 15:57:55 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04    Driver Version: 460.27.04    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-PCIE-40GB      Off  | 00000000:02:00.0 Off |                    0 |
| N/A   51C    P0    42W / 250W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  A100-PCIE-40GB      Off  | 00000000:82:00.0 Off |                    0 |
| N/A   57C    P0    51W / 250W |      0MiB / 40536MiB |     15%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

there is the NVIDIA bug report log file and the logs of the crash :
nvidia-bug-report.log (2.7 MB)

Crash log.txt (48.6 KB)

Thx for your help i will be here to add more informations.

Cordialy,

Phoenix007 · December 14, 2021, 9:23am

I have the same problem. I have a new A100 80GB Pcle in my server.
I have tried the NVIDIA-Linux-x86_64-470.57.02.run and NVIDIA-Linux-x86_64-470.82.01.run as well as the .deb file.
But each time when I put nvidia-smi,I got kernel carsh and have this:

my os is Ubuntu 20.04.3 LTS, and Codename is focal.Can any one help?

generix · December 14, 2021, 10:02am

@adrien.paris the logs show “gpu fallen off the bus” messages, combining this with the server rebooting (which only the mainboard initiates) points to insufficient power. Please check your PSU. Furthermore, nvidia-persistenced needs to be configured to start on boot, otherwise the gpus will misbehave (gpu usage without processes, stuck in P0, crashes).

@Phoenix007 please provide a dmesg output from a clean boot.

Phoenix007 · December 14, 2021, 11:13am

dmesg.log (157.5 KB)
this is log @generix

generix · December 14, 2021, 11:30am

[ 2.135828] pci 0000:ca:00.0: BAR 8: no space for [mem size 0x1400000000 64bit pref]
[ 2.135830] pci 0000:ca:00.0: BAR 8: failed to assign [mem size 0x1400000000 64bit pref]
[ 2.135832] pci 0000:ca:00.0: BAR 10: no space for [mem size 0x28000000 64bit pref]
[ 2.135834] pci 0000:ca:00.0: BAR 10: failed to assign [mem size 0x28000000 64bit pref]
[ 2.135836] pci 0000:ca:00.0: BAR 7: no space for [mem size 0x00500000]
[ 2.135838] pci 0000:ca:00.0: BAR 7: failed to assign [mem size 0x00500000]

Please enable “above 4G decoding”/“large/64bit BAR” in bios.

Phoenix007 · December 14, 2021, 12:00pm

@generix it always enabled~,but it still crash.

generix · December 14, 2021, 12:11pm

Please set kernel parameter
pci=realloc
and attach an updated dmesg output with it set.

brqd · January 27, 2022, 7:12pm

I got crash quite similair to @Phoenix007
After pci=realloc not much has changed
dmesg.log (141.5 KB)

generix · January 28, 2022, 8:22am

Seems the board/bios is incompatible to the A100. Please check for a bios update. Which model is that mainboard/server?

brqd · January 28, 2022, 11:59am

It’s Dell Inc. PowerEdge R750/06V45N, BIOS 1.4.4 10/07/2021
NVIDA A100 80GB PCIe

It’s the latest bios provided by DELL. I also tried with kernel 5.11.0-27, mentioned in compatilility list of CUDA toolkit. The error seems to be the same.

generix · January 28, 2022, 12:08pm

It’s even certified by nvidia to work with an A100 80GB:
https://www.nvidia.com/en-us/data-center/data-center-gpus/qualified-system-catalog/
Please post the output/attach
sudo lspci -t

generix · January 28, 2022, 12:39pm

Checking the bios manual, it also has an option to set the MM I/O Base. please check if changing that makes it work.

brqd · January 28, 2022, 7:40pm

Yes, I’ve checked that setting. There only 3 options: 56TB (default) 12TB and 512GB.
Afer choosing the last one board doesn’t detect GPU. I got in iDRAC log message:

The data communication with the device video.slot.7-1 (GPU Controller in Slot 7 of Instance 1) is lost.

Options 56TB and 12TB works the same way - board shows GPU but nvidia driver crashes.

dmesg.log (137.4 KB)
dmesg_MM12TB.log (135.4 KB)
dmesg_MM512GB.log (119.4 KB)
lspci.log (5.4 KB)

brqd · January 28, 2022, 7:41pm

All oterh opiton for Intergrated Devices:

brqd · January 31, 2022, 4:02pm

Setting kernel paramater pci=realloc=off solves the problem.

I’ve received the solution on Dell’s forum:

Topic		Replies	Views
GPU is lost. Reboot the system to recover this GPU DGX User Forum hw , kernel	3	5582	March 8, 2022
P100 Issues on EL6/7 - /proc/driver/nvidia/gpus/XX/information output is ?? and can't run X Linux	6	2747	October 14, 2021
System crash after "nvidia-smi" command. A100 80GB GPU Drivers - Linux, Windows, MacOS kernel , ubuntu	1	1072	December 14, 2021
Nvidia Driver not loaded Linux	0	593	October 23, 2020
CentOS 8/Driver 440.33 Tesla V100: nvidia-smi reports error 62 Linux	4	1926	October 12, 2021
GPU not detected by nvidia-smi Linux	0	242	July 31, 2024
Missing GPU Linux	5	1860	October 12, 2021
/sys/class/mdev_bus/ Can't Found NVIDIA Virtual GPU Drivers	14	6338	May 19, 2025
Unable to stress NVIDIA RTX A5000 on Ubuntu Linux ubuntu	2	1068	October 2, 2022
Kernel Panel and confusion around nvidia-smi CUDA Setup and Installation	4	1518	February 7, 2017

Systeme crash after "nvidia-smi" command. Rhel7.6/A100 GPU

Related topics