NVSHMEM runtime initialization

Hi,

I have encountered some trouble with inter-node NVSHMEM environment initialization after updating the Nvidia drivers to 560-open.

The error is:

dvs/p4/build/sw/rel/gpgpu/toolkit/r12.2/main_nvshmem/src/comm/transports/common/transport_ib_common.cpp:74: NULL value mem registration failed 
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.2/main_nvshmem/src/comm/transports/ibrc/ibrc.cpp:477: non-zero status: 2 Unable to register memory handle.[a0905:1995345:0:1995345] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
==== backtrace (tid:1995345) ====
 0  /apps/SPACK/0.19.1/opt/linux-almalinux8-zen/gcc-8.5.0/ucx-1.13.1-woaymodwh7p66njpgt76d7fyqyv7srl3/lib/libucs.so.0(ucs_handle_error+0x294) [0x151bf9af4b34]
 1  /apps/SPACK/0.19.1/opt/linux-almalinux8-zen/gcc-8.5.0/ucx-1.13.1-woaymodwh7p66njpgt76d7fyqyv7srl3/lib/libucs.so.0(+0x2cd04) [0x151bf9af4d04]
 2  /apps/SPACK/0.19.1/opt/linux-almalinux8-zen/gcc-8.5.0/ucx-1.13.1-woaymodwh7p66njpgt76d7fyqyv7srl3/lib/libucs.so.0(+0x2cfb8) [0x151bf9af4fb8]
 3  /usr/lib64/libpthread.so.0(+0x12d20) [0x151bfd9aed20]
 4  /apps/SPACK/0.19.1/opt/linux-almalinux8-zen/gcc-8.5.0/nvhpc-23.7-bzxcokzjvx4stynglo4u2ffpljajzlam/Linux_x86_64/23.7/comm_libs/12.2/nvshmem/lib/nvshmem_transport_ibrc.so.1(+0x92c5) [0x151bec3132c5]
 5  /apps/SPACK/0.19.1/opt/linux-almalinux8-zen/gcc-8.5.0/nvhpc-23.7-bzxcokzjvx4stynglo4u2ffpljajzlam/Linux_x86_64/23.7/comm_libs/12.2/nvshmem/lib/nvshmem_transport_ibrc.so.1(+0x25c1) [0x151bec30c5c1]

The same code was running successfully with a runing warming before the update.
The warming was:

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.2/main_nvshmem/src/comm/transports/ibrc/ibrc.cpp:nvshmemt_init:1747: neither nv_peer_mem, or nvidia_peermem detected. Skipping transport.

It seems like I was able to run the code without nv_peer_mem or nvidia_peermem before. However, after the driver update, I’m facing issues.

Could this be related to the driver update? If so, is there any way to resolve this?

Thanks!

Have you tried installing nvidia_peermem?