Hi,
I have encountered some trouble with inter-node NVSHMEM environment initialization after updating the Nvidia drivers to 560-open.
The error is:
dvs/p4/build/sw/rel/gpgpu/toolkit/r12.2/main_nvshmem/src/comm/transports/common/transport_ib_common.cpp:74: NULL value mem registration failed
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.2/main_nvshmem/src/comm/transports/ibrc/ibrc.cpp:477: non-zero status: 2 Unable to register memory handle.[a0905:1995345:0:1995345] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
==== backtrace (tid:1995345) ====
0 /apps/SPACK/0.19.1/opt/linux-almalinux8-zen/gcc-8.5.0/ucx-1.13.1-woaymodwh7p66njpgt76d7fyqyv7srl3/lib/libucs.so.0(ucs_handle_error+0x294) [0x151bf9af4b34]
1 /apps/SPACK/0.19.1/opt/linux-almalinux8-zen/gcc-8.5.0/ucx-1.13.1-woaymodwh7p66njpgt76d7fyqyv7srl3/lib/libucs.so.0(+0x2cd04) [0x151bf9af4d04]
2 /apps/SPACK/0.19.1/opt/linux-almalinux8-zen/gcc-8.5.0/ucx-1.13.1-woaymodwh7p66njpgt76d7fyqyv7srl3/lib/libucs.so.0(+0x2cfb8) [0x151bf9af4fb8]
3 /usr/lib64/libpthread.so.0(+0x12d20) [0x151bfd9aed20]
4 /apps/SPACK/0.19.1/opt/linux-almalinux8-zen/gcc-8.5.0/nvhpc-23.7-bzxcokzjvx4stynglo4u2ffpljajzlam/Linux_x86_64/23.7/comm_libs/12.2/nvshmem/lib/nvshmem_transport_ibrc.so.1(+0x92c5) [0x151bec3132c5]
5 /apps/SPACK/0.19.1/opt/linux-almalinux8-zen/gcc-8.5.0/nvhpc-23.7-bzxcokzjvx4stynglo4u2ffpljajzlam/Linux_x86_64/23.7/comm_libs/12.2/nvshmem/lib/nvshmem_transport_ibrc.so.1(+0x25c1) [0x151bec30c5c1]
The same code was running successfully with a runing warming before the update.
The warming was:
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.2/main_nvshmem/src/comm/transports/ibrc/ibrc.cpp:nvshmemt_init:1747: neither nv_peer_mem, or nvidia_peermem detected. Skipping transport.
It seems like I was able to run the code without nv_peer_mem
or nvidia_peermem
before. However, after the driver update, I’m facing issues.
Could this be related to the driver update? If so, is there any way to resolve this?
Thanks!