Hi,
Occasionally a nvshmem prog fails to init; the more GPUs, the worse it gets. The error on stdout is:
src/comm/transports/ibrc/ibrc.cpp:516: non-zero status: 101 ibv_modify_qp failed
src/comm/transports/ibrc/ibrc.cpp:1425: non-zero status: 7 ep_connect failed
src/comm/transports/ibrc/ibrc.cpp:1490: non-zero status: 7 transport create connect failed
src/comm/transport.cpp:111: non-zero status: 7 endpoint connection failed
src/init/init.cpp:689: non-zero status: 7 nvshmem setup connections failed
src/coll/host/barrier.cpp:nvshmem_barrier_all:35: nvshmem initialization failed, exiting
The installed MLNX OFED is 5.0, and wonder if that could be the cause?
root@n001:~# modinfo nv_peer_mem
filename: /lib/modules/5.4.0-122-generic/updates/dkms/nv_peer_mem.ko
version: 1.3-0
license: Dual BSD/GPL
description: NVIDIA GPU memory plug-in
author: Yishai Hadas
srcversion: EEBE62BF9E5B1A64F1BD327
depends: ib_core,nvidia
retpoline: Y
name: nv_peer_mem
vermagic: 5.4.0-122-generic SMP mod_unload modversions
signat: PKCS#7
signer:
sig_key:
sig_hashalgo: md4
parm: enable_dbg:enable debug tracing (int)
root@n001:~# modinfo gdrdrv
filename: /lib/modules/5.4.0-122-generic/updates/dkms/gdrdrv.ko
version: 2.3
description: GDRCopy kernel-mode driver
license: MIT
author: drossetti@nvidia.com
srcversion: 236C99123346132C4D71397
depends: nv-p2p-dummy
retpoline: Y
name: gdrdrv
vermagic: 5.4.0-122-generic SMP mod_unload modversions
signat: PKCS#7
signer:
sig_key:
sig_hashalgo: md4
parm: dbg_enabled:enable debug tracing (int)
parm: info_enabled:enable info tracing (int)
Please advice!
Rgds,
Tor