Occational initialization failures of nvshmem program

Hi,

 Occasionally a nvshmem prog fails to init; the more GPUs, the worse it gets.  The error on stdout is:
src/comm/transports/ibrc/ibrc.cpp:516: non-zero status: 101 ibv_modify_qp failed
src/comm/transports/ibrc/ibrc.cpp:1425: non-zero status: 7 ep_connect failed
src/comm/transports/ibrc/ibrc.cpp:1490: non-zero status: 7 transport create connect failed
src/comm/transport.cpp:111: non-zero status: 7 endpoint connection failed
src/init/init.cpp:689: non-zero status: 7 nvshmem setup connections failed
src/coll/host/barrier.cpp:nvshmem_barrier_all:35: nvshmem initialization failed, exiting

The installed MLNX OFED is 5.0, and wonder if that could be the cause?

root@n001:~# modinfo nv_peer_mem 
filename:       /lib/modules/5.4.0-122-generic/updates/dkms/nv_peer_mem.ko
version:        1.3-0
license:        Dual BSD/GPL
description:    NVIDIA GPU memory plug-in
author:         Yishai Hadas
srcversion:     EEBE62BF9E5B1A64F1BD327
depends:        ib_core,nvidia
retpoline:      Y
name:           nv_peer_mem
vermagic:       5.4.0-122-generic SMP mod_unload modversions 
signat:         PKCS#7
signer:         
sig_key:        
sig_hashalgo:   md4
parm:           enable_dbg:enable debug tracing (int)
root@n001:~# modinfo gdrdrv 
filename:       /lib/modules/5.4.0-122-generic/updates/dkms/gdrdrv.ko
version:        2.3
description:    GDRCopy kernel-mode driver
license:        MIT
author:         drossetti@nvidia.com
srcversion:     236C99123346132C4D71397
depends:        nv-p2p-dummy
retpoline:      Y
name:           gdrdrv
vermagic:       5.4.0-122-generic SMP mod_unload modversions 
signat:         PKCS#7
signer:         
sig_key:        
sig_hashalgo:   md4
parm:           dbg_enabled:enable debug tracing (int)
parm:           info_enabled:enable info tracing (int)

Please advice!

Rgds,
Tor

hi Tor

There’s a lot reason can caused ibv_modify_qp failed.
It’s hard to get reason from current logs, I suggest you can retest with ofed 5.7