Systeme crash after "nvidia-smi" command. Rhel7.6/A100 GPU

Hello,

I have 2 new A100 in my server wish is running on rhel7.6.
After some time of uptime (like 1hour) everytime i try to use nvidia-smi my systeme reboot.

When i’m lucky and just after a reboot, i can have this, and i dont know why le GPU 1 is used at 15% and sometimes at 100% for no reason.


nvidia-smi
Wed Jun  9 15:57:55 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04    Driver Version: 460.27.04    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-PCIE-40GB      Off  | 00000000:02:00.0 Off |                    0 |
| N/A   51C    P0    42W / 250W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  A100-PCIE-40GB      Off  | 00000000:82:00.0 Off |                    0 |
| N/A   57C    P0    51W / 250W |      0MiB / 40536MiB |     15%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

there is the NVIDIA bug report log file and the logs of the crash :
nvidia-bug-report.log (2.7 MB)

Crash log.txt (48.6 KB)

Thx for your help i will be here to add more informations.

Cordialy,

I have the same problem. I have a new A100 80GB Pcle in my server.
I have tried the NVIDIA-Linux-x86_64-470.57.02.run and NVIDIA-Linux-x86_64-470.82.01.run as well as the .deb file.
But each time when I put nvidia-smi,I got kernel carsh and have this:


my os is Ubuntu 20.04.3 LTS, and Codename is focal.Can any one help?

@adrien.paris the logs show “gpu fallen off the bus” messages, combining this with the server rebooting (which only the mainboard initiates) points to insufficient power. Please check your PSU. Furthermore, nvidia-persistenced needs to be configured to start on boot, otherwise the gpus will misbehave (gpu usage without processes, stuck in P0, crashes).

@Phoenix007 please provide a dmesg output from a clean boot.

dmesg.log (157.5 KB)
this is log @generix

[ 2.135828] pci 0000:ca:00.0: BAR 8: no space for [mem size 0x1400000000 64bit pref]
[ 2.135830] pci 0000:ca:00.0: BAR 8: failed to assign [mem size 0x1400000000 64bit pref]
[ 2.135832] pci 0000:ca:00.0: BAR 10: no space for [mem size 0x28000000 64bit pref]
[ 2.135834] pci 0000:ca:00.0: BAR 10: failed to assign [mem size 0x28000000 64bit pref]
[ 2.135836] pci 0000:ca:00.0: BAR 7: no space for [mem size 0x00500000]
[ 2.135838] pci 0000:ca:00.0: BAR 7: failed to assign [mem size 0x00500000]

Please enable “above 4G decoding”/“large/64bit BAR” in bios.


@generix it always enabled~,but it still crash.

Please set kernel parameter
pci=realloc
and attach an updated dmesg output with it set.

I got crash quite similair to @Phoenix007
After pci=realloc not much has changed
dmesg.log (141.5 KB)

Seems the board/bios is incompatible to the A100. Please check for a bios update. Which model is that mainboard/server?

It’s Dell Inc. PowerEdge R750/06V45N, BIOS 1.4.4 10/07/2021
NVIDA A100 80GB PCIe

It’s the latest bios provided by DELL. I also tried with kernel 5.11.0-27, mentioned in compatilility list of CUDA toolkit. The error seems to be the same.

It’s even certified by nvidia to work with an A100 80GB:
https://www.nvidia.com/en-us/data-center/data-center-gpus/qualified-system-catalog/
Please post the output/attach
sudo lspci -t

Checking the bios manual, it also has an option to set the MM I/O Base. please check if changing that makes it work.

Yes, I’ve checked that setting. There only 3 options: 56TB (default) 12TB and 512GB.
Afer choosing the last one board doesn’t detect GPU. I got in iDRAC log message:

The data communication with the device video.slot.7-1 (GPU Controller in Slot 7 of Instance 1) is lost.

Options 56TB and 12TB works the same way - board shows GPU but nvidia driver crashes.

dmesg.log (137.4 KB)
dmesg_MM12TB.log (135.4 KB)
dmesg_MM512GB.log (119.4 KB)
lspci.log (5.4 KB)

All oterh opiton for Intergrated Devices:

Setting kernel paramater pci=realloc=off solves the problem.

I’ve received the solution on Dell’s forum: