Hello,
On my HGX GPU cluster I have the following error which occurs when the trainings start to run.
This causes problems with the reliability of the AI models.
Have you ever had this error? Do you have any ideas?
Thanks for help.
Best
[25033.266922] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.280589] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.300821] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.320988] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.342081] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.360507] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.380740] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.400553] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.420777] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.440911] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.461063] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.481198] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.501350] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
modinfo nvidia_peermem
filename: /lib/modules/5.15.0-1060-nvidia/updates/dkms/nvidia-peermem.ko
version: 535.183.01
license: Linux-OpenIB
description: NVIDIA GPU memory plug-in
author: Yishai Hadas
srcversion: 14719BA3A7992087B56C615
depends: nvidia,ib_core
retpoline: Y
name: nvidia_peermem
vermagic: 5.15.0-1060-nvidia SMP mod_unload modversions
sig_id: PKCS#7
signer: cloudvm Secure Boot Module Signature key
sig_key: 51:5C:6A:D4:65:84:FC:3D:1F:16:36:3B:C0:BE:11:38:40:66:2B:40
sig_hashalgo: sha512
signature: AB:28:07:0D:59:AC:6C:F5:34:44:57:F7:AC:83:62:48:99:B7:90:71:
A4:E8:1E:D7:21:94:6D:80:69:6B:19:00:D3:2A:D5:A8:CA:B4:CE:3C:
54:FC:CD:05:85:DA:39:58:E7:47:45:3E:88:F6:9B:8C:FF:62:75:7E:
0E:70:D1:0D:7B:28:AF:9B:FC:CE:8E:5A:69:63:55:5E:F2:69:A6:8B:
8C:7D:7D:E5:EA:0C:1F:3C:A7:F6:6E:DA:8B:D0:EA:4D:65:CE:24:2B:
31:5B:00:32:A6:D2:8C:C9:AB:40:42:EF:82:B3:A7:9F:93:7C:2E:9A:
3F:C9:43:B6:99:B8:F6:11:62:C5:70:C9:BC:7B:5B:E7:9E:38:9F:91:
8C:A4:91:B9:7A:31:2B:4D:76:3B:94:47:3D:9A:08:12:54:A8:7A:B8:
AE:A7:E2:E6:29:DA:3F:30:5F:CA:79:6C:E1:95:82:EF:E4:B9:48:ED:
AF:6E:62:94:F3:F6:93:73:C2:11:72:E0:A2:9A:CF:38:81:28:57:F6:
B2:CB:E0:FF:A2:20:4B:7B:16:30:0F:7F:4B:51:91:D2:2E:53:D7:71:
66:C0:4A:0B:D6:E6:BC:DC:04:51:6D:6E:92:DA:4B:F1:EA:D2:37:9A:
AE:3C:F1:83:AC:7B:C1:35:2D:EF:E7:2A:92:E7:AB:D7
parm: peerdirect_support:Set level of support for Peer-direct, 0 [default] or 1 [legacy, for example MLNX_OFED 4.9 LTS] (int)
nvidia-smi
Fri Aug 30 16:55:07 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA H100 80GB HBM3 Off | 00000000:0A:00.0 Off | 0 |
| N/A 30C P0 71W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA H100 80GB HBM3 Off | 00000000:18:00.0 Off | 0 |
| N/A 26C P0 71W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA H100 80GB HBM3 Off | 00000000:23:00.0 Off | 0 |
| N/A 25C P0 70W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA H100 80GB HBM3 Off | 00000000:2C:00.0 Off | 0 |
| N/A 30C P0 71W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA H100 80GB HBM3 Off | 00000000:87:00.0 Off | 0 |
| N/A 30C P0 70W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA H100 80GB HBM3 Off | 00000000:90:00.0 Off | 0 |
| N/A 25C P0 70W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA H100 80GB HBM3 Off | 00000000:B8:00.0 Off | 0 |
| N/A 25C P0 71W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA H100 80GB HBM3 Off | 00000000:C1:00.0 Off | 0 |
| N/A 30C P0 72W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
Operating System: Ubuntu 22.04.4 LTS
Kernel: Linux 5.15.0-1060-nvidia
ofed_info -s
MLNX_OFED_LINUX-23.10-1.1.9.0