Nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing

Hello,

On my HGX GPU cluster I have the following error which occurs when the trainings start to run.
This causes problems with the reliability of the AI models.

Have you ever had this error? Do you have any ideas?

Thanks for help.

Best

[25033.266922] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.280589] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.300821] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.320988] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.342081] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.360507] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.380740] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.400553] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.420777] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.440911] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.461063] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.481198] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.501350] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
modinfo nvidia_peermem
filename:       /lib/modules/5.15.0-1060-nvidia/updates/dkms/nvidia-peermem.ko
version:        535.183.01
license:        Linux-OpenIB
description:    NVIDIA GPU memory plug-in
author:         Yishai Hadas
srcversion:     14719BA3A7992087B56C615
depends:        nvidia,ib_core
retpoline:      Y
name:           nvidia_peermem
vermagic:       5.15.0-1060-nvidia SMP mod_unload modversions 
sig_id:         PKCS#7
signer:         cloudvm Secure Boot Module Signature key
sig_key:        51:5C:6A:D4:65:84:FC:3D:1F:16:36:3B:C0:BE:11:38:40:66:2B:40
sig_hashalgo:   sha512
signature:      AB:28:07:0D:59:AC:6C:F5:34:44:57:F7:AC:83:62:48:99:B7:90:71:
                A4:E8:1E:D7:21:94:6D:80:69:6B:19:00:D3:2A:D5:A8:CA:B4:CE:3C:
                54:FC:CD:05:85:DA:39:58:E7:47:45:3E:88:F6:9B:8C:FF:62:75:7E:
                0E:70:D1:0D:7B:28:AF:9B:FC:CE:8E:5A:69:63:55:5E:F2:69:A6:8B:
                8C:7D:7D:E5:EA:0C:1F:3C:A7:F6:6E:DA:8B:D0:EA:4D:65:CE:24:2B:
                31:5B:00:32:A6:D2:8C:C9:AB:40:42:EF:82:B3:A7:9F:93:7C:2E:9A:
                3F:C9:43:B6:99:B8:F6:11:62:C5:70:C9:BC:7B:5B:E7:9E:38:9F:91:
                8C:A4:91:B9:7A:31:2B:4D:76:3B:94:47:3D:9A:08:12:54:A8:7A:B8:
                AE:A7:E2:E6:29:DA:3F:30:5F:CA:79:6C:E1:95:82:EF:E4:B9:48:ED:
                AF:6E:62:94:F3:F6:93:73:C2:11:72:E0:A2:9A:CF:38:81:28:57:F6:
                B2:CB:E0:FF:A2:20:4B:7B:16:30:0F:7F:4B:51:91:D2:2E:53:D7:71:
                66:C0:4A:0B:D6:E6:BC:DC:04:51:6D:6E:92:DA:4B:F1:EA:D2:37:9A:
                AE:3C:F1:83:AC:7B:C1:35:2D:EF:E7:2A:92:E7:AB:D7
parm:           peerdirect_support:Set level of support for Peer-direct, 0 [default] or 1 [legacy, for example MLNX_OFED 4.9 LTS] (int)
nvidia-smi 
Fri Aug 30 16:55:07 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 80GB HBM3          Off | 00000000:0A:00.0 Off |                    0 |
| N/A   30C    P0              71W / 700W |      0MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          Off | 00000000:18:00.0 Off |                    0 |
| N/A   26C    P0              71W / 700W |      0MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          Off | 00000000:23:00.0 Off |                    0 |
| N/A   25C    P0              70W / 700W |      0MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          Off | 00000000:2C:00.0 Off |                    0 |
| N/A   30C    P0              71W / 700W |      0MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          Off | 00000000:87:00.0 Off |                    0 |
| N/A   30C    P0              70W / 700W |      0MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          Off | 00000000:90:00.0 Off |                    0 |
| N/A   25C    P0              70W / 700W |      0MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          Off | 00000000:B8:00.0 Off |                    0 |
| N/A   25C    P0              71W / 700W |      0MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          Off | 00000000:C1:00.0 Off |                    0 |
| N/A   30C    P0              72W / 700W |      0MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
Operating System: Ubuntu 22.04.4 LTS              
Kernel: Linux 5.15.0-1060-nvidia
ofed_info -s
MLNX_OFED_LINUX-23.10-1.1.9.0