Systeme crash after "nvidia-smi" command. Rhel7.6/A100 GPU

Hello,

I have 2 new A100 in my server wish is running on rhel7.6.
After some time of uptime (like 1hour) everytime i try to use nvidia-smi my systeme reboot.

When i’m lucky and just after a reboot, i can have this, and i dont know why le GPU 1 is used at 15% and sometimes at 100% for no reason.


nvidia-smi
Wed Jun  9 15:57:55 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04    Driver Version: 460.27.04    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-PCIE-40GB      Off  | 00000000:02:00.0 Off |                    0 |
| N/A   51C    P0    42W / 250W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  A100-PCIE-40GB      Off  | 00000000:82:00.0 Off |                    0 |
| N/A   57C    P0    51W / 250W |      0MiB / 40536MiB |     15%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

there is the NVIDIA bug report log file and the logs of the crash :
nvidia-bug-report.log (2.7 MB)

Crash log.txt (48.6 KB)

Thx for your help i will be here to add more informations.

Cordialy,