Hello,
I currently am worling on a cluster with a node like so:
Running a binding script like this:
#!/bin/bash
EXE=$1
ARGS=$2
APP="$EXE $ARGS"
# This is the list of GPUs we have
GPUS=(0 1 2 3)
# This is the list of NICs we should use for each GPU
# e.g., associate GPUs 0,1 with MLX0, GPUs 2,3 with MLX1
NICS=(mlx5_0:1 mlx5_0:1 mlx5_1:1 mlx5_1:1)
# This is the list of CPU cores we should use for each GPU
# On the Ampere nodes we have 2x64 core CPUs, each organised into 4 NUMA domains
# We will use only a subset of the available NUMA domains, i.e. 1 NUMA domain per GPU
# The NUMA domain closest to each GPU can be extracted from nvidia-smi
CPUS=(48-63 16-31 112-127 80-95)
# This is the list of memory domains we should use for each GPU
MEMS=(3 1 7 5)
# Number of physical CPU cores per GPU (optional)
export OMP_NUM_THREADS=16
lrank=$OMPI_COMM_WORLD_LOCAL_RANK
export CUDA_VISIBLE_DEVICES=${GPUS[${lrank}]}
export UCX_NET_DEVICES=${NICS[${lrank}]}
numactl --physcpubind=${CPUS[${lrank}]} --membind=${MEMS[${lrank}]} $APP
I am trying to use MPI ranks to give me NVSHMEM PEs like in the documentation (Using NVSHMEM — NVSHMEM 3.0.6 documentation)
The goal is to have each GPU along with close CPU cores in an MPI ranks and NVSHMEM PE. How do I verify that this is working correctly? I have already used nsys to see that each MPI rank is running on each GPU seperately, but how do I know that they are using the correct NIC? Also have the symmeric heaps already been allocated (from the line gpu-q-74:4164305:4164305 [0] NVSHMEM INFO [0] allocated 16777216 bytes, ptr: 0x28260000000)?
Can someone run me though what the NVSHMEM start up debug means?
NVSHMEM configuration:
CUDA API 11040
CUDA Runtime 11040
CUDA Driver 12040
Build Timestamp Sep 10 2024 11:39:26
gpu-q-74:4164299:4164299 [0] NVSHMEM INFO PE distribution has been identified as NVSHMEMI_PE_DIST_BLOCK
gpu-q-74:4164299:4164299 [0] NVSHMEM INFO PE 3 (process) affinity to 16 CPUs:
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
Build Variables
NVSHMEM_DEBUG=OFF NVSHMEM_DEVEL=OFF NVSHMEM_DEFAULT_PMI2=OFF
NVSHMEM_DEFAULT_PMIX=OFF NVSHMEM_DEFAULT_UCX=OFF NVSHMEM_DISABLE_COLL_POLL=ON
NVSHMEM_ENABLE_ALL_DEVICE_INLINING=OFF NVSHMEM_GPU_COLL_USE_LDST=OFF
NVSHMEM_IBGDA_SUPPORT=OFF NVSHMEM_IBGDA_SUPPORT_GPUMEM_ONLY=OFF
NVSHMEM_IBDEVX_SUPPORT=OFF NVSHMEM_IBRC_SUPPORT=ON
NVSHMEM_MPI_SUPPORT=1 NVSHMEM_NVTX=ON NVSHMEM_PMIX_SUPPORT=OFF
NVSHMEM_SHMEM_SUPPORT=OFF NVSHMEM_TEST_STATIC_LIB=OFF
NVSHMEM_TIMEOUT_DEVICE_POLLING=OFF NVSHMEM_TRACE=OFF
NCCL_HOME=/usr/local/nccl
NVSHMEM_PREFIX=/home/co-morg1/rds/hpc-work/nvshmem_sep_10
UCX_HOME=/usr/local/software/spack/spack-rhel8-20210927/opt/spack/linux-centos8-zen2/gcc-9.4.0/ucx-1.11.1-lktqyl4gjbz36wqifl2e2wonn65xtrsr
gpu-q-74:4164305:4164305 [0] NVSHMEM INFO PE 0 (process) affinity to 16 CPUs:
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
gpu-q-74:4164298:4164298 [0] NVSHMEM INFO PE distribution has been identified as NVSHMEMI_PE_DIST_BLOCK
gpu-q-74:4164298:4164298 [0] NVSHMEM INFO PE 1 (process) affinity to 16 CPUs:
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
gpu-q-74:4164303:4164303 [0] NVSHMEM INFO PE distribution has been identified as NVSHMEMI_PE_DIST_BLOCK
gpu-q-74:4164303:4164303 [0] NVSHMEM INFO PE 2 (process) affinity to 16 CPUs:
112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
gpu-q-74:4164305:4164305 [0] NVSHMEM INFO cudaDriverVersion 12040
gpu-q-74:4164298:4164298 [0] NVSHMEM INFO cudaDriverVersion 12040
gpu-q-74:4164305:4164305 [0] NVSHMEM INFO NVSHMEM symmetric heap kind = DEVICE selected
gpu-q-74:4164299:4164299 [0] NVSHMEM INFO cudaDriverVersion 12040
gpu-q-74:4164298:4164298 [0] NVSHMEM INFO NVSHMEM symmetric heap kind = DEVICE selected
gpu-q-74:4164298:4164298 [0] NVSHMEM INFO [1] nvshmemi_get_cucontext->cuCtxSynchronize->CUDA_SUCCESS) my_stream (nil)
gpu-q-74:4164298:4164298 [0] NVSHMEM INFO in get_cucontext, queried and saved context for device: 0 context: 0x36b83d0
gpu-q-74:4164305:4164305 [0] NVSHMEM INFO [0] nvshmemi_get_cucontext->cuCtxSynchronize->CUDA_SUCCESS) my_stream (nil)
gpu-q-74:4164305:4164305 [0] NVSHMEM INFO in get_cucontext, queried and saved context for device: 0 context: 0x1d418e0
gpu-q-74:4164303:4164303 [0] NVSHMEM INFO cudaDriverVersion 12040
gpu-q-74:4164303:4164303 [0] NVSHMEM INFO NVSHMEM symmetric heap kind = DEVICE selected
gpu-q-74:4164303:4164303 [0] NVSHMEM INFO [2] nvshmemi_get_cucontext->cuCtxSynchronize->CUDA_SUCCESS) my_stream (nil)
gpu-q-74:4164303:4164303 [0] NVSHMEM INFO in get_cucontext, queried and saved context for device: 0 context: 0x31e9980
gpu-q-74:4164299:4164299 [0] NVSHMEM INFO NVSHMEM symmetric heap kind = DEVICE selected
gpu-q-74:4164299:4164299 [0] NVSHMEM INFO [3] nvshmemi_get_cucontext->cuCtxSynchronize->CUDA_SUCCESS) my_stream (nil)
gpu-q-74:4164299:4164299 [0] NVSHMEM INFO in get_cucontext, queried and saved context for device: 0 context: 0x206ff70
gpu-q-74:4164305:4164305 [0] NVSHMEM INFO [0] nvshmemi_get_cucontext->cuCtxGetDevice->0(CUDA_ERROR_INVALID_CONTEXT 201) cuStreamCreateWithPriority my_stream 0x479b230
gpu-q-74:4164299:4164299 [0] NVSHMEM INFO [3] nvshmemi_get_cucontext->cuCtxGetDevice->0(CUDA_ERROR_INVALID_CONTEXT 201) cuStreamCreateWithPriority my_stream 0x4ab0730
gpu-q-74:4164298:4164298 [0] NVSHMEM INFO [1] nvshmemi_get_cucontext->cuCtxGetDevice->0(CUDA_ERROR_INVALID_CONTEXT 201) cuStreamCreateWithPriority my_stream 0x60d9480
gpu-q-74:4164303:4164303 [0] NVSHMEM INFO [2] nvshmemi_get_cucontext->cuCtxGetDevice->0(CUDA_ERROR_INVALID_CONTEXT 201) cuStreamCreateWithPriority my_stream 0x5c33e30
gpu-q-74:4164303:4164303 [0] NVSHMEM INFO nvshmemi_setup_local_heap, heapextra = 285225000
gpu-q-74:4164305:4164305 [0] NVSHMEM INFO nvshmemi_setup_local_heap, heapextra = 285225000
gpu-q-74:4164299:4164299 [0] NVSHMEM INFO nvshmemi_setup_local_heap, heapextra = 285225000
gpu-q-74:4164298:4164298 [0] NVSHMEM INFO nvshmemi_setup_local_heap, heapextra = 285225000
gpu-q-74:4164303:4164303 [0] NVSHMEM INFO NVML library found. libnvidia-ml.so
gpu-q-74:4164299:4164299 [0] NVSHMEM INFO NVML library found. libnvidia-ml.so
gpu-q-74:4164305:4164305 [0] NVSHMEM INFO NVML library found. libnvidia-ml.so
gpu-q-74:4164298:4164298 [0] NVSHMEM INFO NVML library found. libnvidia-ml.so
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/common/transport_gdr_common.cpp 73 GDR driver version: (2, 4)
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 1635 Begin - Enumerating IB devices in the system ([<dev_id, device_name, num_ports>]) -
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/common/transport_gdr_common.cpp 73 GDR driver version: (2, 4)
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 1635 Begin - Enumerating IB devices in the system ([<dev_id, device_name, num_ports>]) -
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/common/transport_gdr_common.cpp 73 GDR driver version: (2, 4)
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 1635 Begin - Enumerating IB devices in the system ([<dev_id, device_name, num_ports>]) -
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/common/transport_gdr_common.cpp 73 GDR driver version: (2, 4)
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 1635 Begin - Enumerating IB devices in the system ([<dev_id, device_name, num_ports>]) -
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 1656 Enumerated IB devices in the system - device id=0 (of 4), name=mlx5_0, num_ports=1
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 1656 Enumerated IB devices in the system - device id=1 (of 4), name=mlx5_1, num_ports=1
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 1656 Enumerated IB devices in the system - device id=0 (of 4), name=mlx5_0, num_ports=1
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 1656 Enumerated IB devices in the system - device id=0 (of 4), name=mlx5_0, num_ports=1
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 1656 Enumerated IB devices in the system - device id=0 (of 4), name=mlx5_0, num_ports=1
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 1656 Enumerated IB devices in the system - device id=2 (of 4), name=mlx5_2, num_ports=1
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 1656 Enumerated IB devices in the system - device id=3 (of 4), name=mlx5_3, num_ports=1
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 1737 End - Enumerating IB devices in the system
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 1742 Begin - Ordered list of devices for assignment (after processing user provdied env vars (if any)) -
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 1746 Ordered list of devices for assignment - idx=0 (of 2), device id=0, port_num=1
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 1746 Ordered list of devices for assignment - idx=1 (of 2), device id=1, port_num=1
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 1750 End - Ordered list of devices for assignment (after processing user provdied env vars (if any))
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 212 /home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp:1790 Ib Alloc Size 2097152 pointer 0x5cb8000
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 1656 Enumerated IB devices in the system - device id=1 (of 4), name=mlx5_1, num_ports=1
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 1656 Enumerated IB devices in the system - device id=1 (of 4), name=mlx5_1, num_ports=1
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 1656 Enumerated IB devices in the system - device id=1 (of 4), name=mlx5_1, num_ports=1
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 1656 Enumerated IB devices in the system - device id=2 (of 4), name=mlx5_2, num_ports=1
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 1656 Enumerated IB devices in the system - device id=2 (of 4), name=mlx5_2, num_ports=1
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 1656 Enumerated IB devices in the system - device id=2 (of 4), name=mlx5_2, num_ports=1
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 1656 Enumerated IB devices in the system - device id=3 (of 4), name=mlx5_3, num_ports=1
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 1656 Enumerated IB devices in the system - device id=3 (of 4), name=mlx5_3, num_ports=1
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 1656 Enumerated IB devices in the system - device id=3 (of 4), name=mlx5_3, num_ports=1
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 1737 End - Enumerating IB devices in the system
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 1742 Begin - Ordered list of devices for assignment (after processing user provdied env vars (if any)) -
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 1746 Ordered list of devices for assignment - idx=0 (of 2), device id=0, port_num=1
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 1746 Ordered list of devices for assignment - idx=1 (of 2), device id=1, port_num=1
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 1750 End - Ordered list of devices for assignment (after processing user provdied env vars (if any))
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 1737 End - Enumerating IB devices in the system
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 1742 Begin - Ordered list of devices for assignment (after processing user provdied env vars (if any)) -
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 1746 Ordered list of devices for assignment - idx=0 (of 2), device id=0, port_num=1
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 1746 Ordered list of devices for assignment - idx=1 (of 2), device id=1, port_num=1
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 1750 End - Ordered list of devices for assignment (after processing user provdied env vars (if any))
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 1737 End - Enumerating IB devices in the system
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 1742 Begin - Ordered list of devices for assignment (after processing user provdied env vars (if any)) -
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 1746 Ordered list of devices for assignment - idx=0 (of 2), device id=0, port_num=1
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 1746 Ordered list of devices for assignment - idx=1 (of 2), device id=1, port_num=1
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 1750 End - Ordered list of devices for assignment (after processing user provdied env vars (if any))
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 212 /home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp:1790 Ib Alloc Size 2097152 pointer 0x4819000
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 212 /home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp:1790 Ib Alloc Size 2097152 pointer 0x6157000
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 212 /home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp:1790 Ib Alloc Size 2097152 pointer 0x4b2f000
gpu-q-74:4164298:4164298 [0] NVSHMEM INFO NVSHMEM_ENABLE_NIC_PE_MAPPING = 0, device 0 setting dev_id = 1
gpu-q-74:4164305:4164305 [0] NVSHMEM INFO NVSHMEM_ENABLE_NIC_PE_MAPPING = 0, device 0 setting dev_id = 1
gpu-q-74:4164303:4164303 [0] NVSHMEM INFO NVSHMEM_ENABLE_NIC_PE_MAPPING = 0, device 0 setting dev_id = 0
gpu-q-74:4164299:4164299 [0] NVSHMEM INFO NVSHMEM_ENABLE_NIC_PE_MAPPING = 0, device 0 setting dev_id = 0
gpu-q-74:4164305:4164305 [0] NVSHMEM INFO [0] status 0 cudaErrorInvalidValue 1 cudaErrorInvalidSymbol 13 cudaErrorInvalidMemcpyDirection 21 cudaErrorNoKernelImageForDevice 209
gpu-q-74:4164299:4164299 [0] NVSHMEM INFO [3] status 0 cudaErrorInvalidValue 1 cudaErrorInvalidSymbol 13 cudaErrorInvalidMemcpyDirection 21 cudaErrorNoKernelImageForDevice 209
gpu-q-74:4164303:4164303 [0] NVSHMEM INFO [2] status 0 cudaErrorInvalidValue 1 cudaErrorInvalidSymbol 13 cudaErrorInvalidMemcpyDirection 21 cudaErrorNoKernelImageForDevice 209
gpu-q-74:4164298:4164298 [0] NVSHMEM INFO [1] status 0 cudaErrorInvalidValue 1 cudaErrorInvalidSymbol 13 cudaErrorInvalidMemcpyDirection 21 cudaErrorNoKernelImageForDevice 209
gpu-q-74:4164305:4164305 [0] NVSHMEM INFO calling get_mem_handle for transport: 0 buf: 0x28260000000 size: 536870912
gpu-q-74:4164298:4164298 [0] NVSHMEM INFO calling get_mem_handle for transport: 0 buf: 0x14a400000000 size: 536870912
gpu-q-74:4164303:4164303 [0] NVSHMEM INFO calling get_mem_handle for transport: 0 buf: 0x2b0e0000000 size: 536870912
gpu-q-74:4164299:4164299 [0] NVSHMEM INFO calling get_mem_handle for transport: 0 buf: 0x14d5a0000000 size: 536870912
gpu-q-74:4164305:4164305 [0] NVSHMEM INFO [0] get_mem_handle transport 0 handles 0x7ffd2713d950
gpu-q-74:4164305:4164305 [0] NVSHMEM INFO calling get_mem_handle for transport: 1 buf: 0x28260000000 size: 536870912
gpu-q-74:4164303:4164303 [0] NVSHMEM INFO [2] get_mem_handle transport 0 handles 0x7fffbc063200
gpu-q-74:4164303:4164303 [0] NVSHMEM INFO calling get_mem_handle for transport: 1 buf: 0x2b0e0000000 size: 536870912
gpu-q-74:4164298:4164298 [0] NVSHMEM INFO [1] get_mem_handle transport 0 handles 0x7fff9a32ff80
gpu-q-74:4164298:4164298 [0] NVSHMEM INFO calling get_mem_handle for transport: 1 buf: 0x14a400000000 size: 536870912
gpu-q-74:4164299:4164299 [0] NVSHMEM INFO [3] get_mem_handle transport 0 handles 0x7fff9968c4a0
gpu-q-74:4164299:4164299 [0] NVSHMEM INFO calling get_mem_handle for transport: 1 buf: 0x14d5a0000000 size: 536870912
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/common/transport_ib_common.cpp 96 ibv_reg_mr handle 0x7ffd2713db50 handle->mr (nil)
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/common/transport_ib_common.cpp 96 ibv_reg_mr handle 0x7fffbc063400 handle->mr (nil)
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 212 /home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp:559 Ib Alloc Size 8 pointer 0x5492000
gpu-q-74:4164305:4164305 [0] NVSHMEM INFO [0] get_mem_handle transport 1 handles 0x7ffd2713db50
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/common/transport_ib_common.cpp 96 ibv_reg_mr handle 0x7fff9a330180 handle->mr (nil)
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 212 /home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp:559 Ib Alloc Size 8 pointer 0x5c3c000
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/common/transport_ib_common.cpp 96 ibv_reg_mr handle 0x7fff9968c6a0 handle->mr (nil)
gpu-q-74:4164303:4164303 [0] NVSHMEM INFO [2] get_mem_handle transport 1 handles 0x7fffbc063400
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 212 /home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp:559 Ib Alloc Size 8 pointer 0x6dd0000
gpu-q-74:4164298:4164298 [0] NVSHMEM INFO [1] get_mem_handle transport 1 handles 0x7fff9a330180
/home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp 212 /home/co-morg1/rds/hpc-work/nvshmem_src_2.11.0-5/src/modules/transport/ibrc/ibrc.cpp:559 Ib Alloc Size 8 pointer 0x4d4c000
gpu-q-74:4164299:4164299 [0] NVSHMEM INFO [3] get_mem_handle transport 1 handles 0x7fff9968c6a0
gpu-q-74:4164305:4164305 [0] NVSHMEM INFO [0] cuIpcOpenMemHandle fromhandle 0x70000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
gpu-q-74:4164305:4164305 [0] NVSHMEM INFO [0] cuIpcOpenMemHandle tobuf 0x2a260000000
gpu-q-74:4164299:4164299 [0] NVSHMEM INFO [3] cuIpcOpenMemHandle fromhandle 0x72000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
gpu-q-74:4164299:4164299 [0] NVSHMEM INFO [3] cuIpcOpenMemHandle tobuf 0x14f5a0000000
gpu-q-74:4164305:4164305 [0] NVSHMEM INFO [0] cuIpcOpenMemHandle fromhandle 0x72000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
gpu-q-74:4164305:4164305 [0] NVSHMEM INFO [0] cuIpcOpenMemHandle tobuf 0x2c260000000
gpu-q-74:4164298:4164298 [0] NVSHMEM INFO [1] cuIpcOpenMemHandle fromhandle 0x71000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
gpu-q-74:4164298:4164298 [0] NVSHMEM INFO [1] cuIpcOpenMemHandle tobuf 0x14c400000000
gpu-q-74:4164305:4164305 [0] NVSHMEM INFO [0] cuIpcOpenMemHandle fromhandle 0x71000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
gpu-q-74:4164305:4164305 [0] NVSHMEM INFO [0] cuIpcOpenMemHandle tobuf 0x2e260000000
gpu-q-74:4164299:4164299 [0] NVSHMEM INFO [3] cuIpcOpenMemHandle fromhandle 0x70000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
gpu-q-74:4164299:4164299 [0] NVSHMEM INFO [3] cuIpcOpenMemHandle tobuf 0x1515a0000000
gpu-q-74:4164299:4164299 [0] NVSHMEM INFO [3] cuIpcOpenMemHandle fromhandle 0x71000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
gpu-q-74:4164299:4164299 [0] NVSHMEM INFO [3] cuIpcOpenMemHandle tobuf 0x1535a0000000
gpu-q-74:4164303:4164303 [0] NVSHMEM INFO [2] cuIpcOpenMemHandle fromhandle 0x71000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
gpu-q-74:4164303:4164303 [0] NVSHMEM INFO [2] cuIpcOpenMemHandle tobuf 0x2d0e0000000
gpu-q-74:4164298:4164298 [0] NVSHMEM INFO [1] cuIpcOpenMemHandle fromhandle 0x70000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
gpu-q-74:4164298:4164298 [0] NVSHMEM INFO [1] cuIpcOpenMemHandle tobuf 0x14e400000000
gpu-q-74:4164303:4164303 [0] NVSHMEM INFO [2] cuIpcOpenMemHandle fromhandle 0x72000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
gpu-q-74:4164303:4164303 [0] NVSHMEM INFO [2] cuIpcOpenMemHandle tobuf 0x2f0e0000000
gpu-q-74:4164298:4164298 [0] NVSHMEM INFO [1] cuIpcOpenMemHandle fromhandle 0x72000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
gpu-q-74:4164298:4164298 [0] NVSHMEM INFO [1] cuIpcOpenMemHandle tobuf 0x150400000000
gpu-q-74:4164303:4164303 [0] NVSHMEM INFO [2] cuIpcOpenMemHandle fromhandle 0x70000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
gpu-q-74:4164303:4164303 [0] NVSHMEM INFO [2] cuIpcOpenMemHandle tobuf 0x310e0000000
gpu-q-74:4164305:4164305 [0] NVSHMEM INFO [0] allocated 16777216 bytes, ptr: 0x28260000000
gpu-q-74:4164298:4164298 [0] NVSHMEM INFO [1] allocated 16777216 bytes, ptr: 0x14a400000000
gpu-q-74:4164299:4164299 [0] NVSHMEM INFO [3] allocated 16777216 bytes, ptr: 0x14d5a0000000
gpu-q-74:4164303:4164303 [0] NVSHMEM INFO [2] allocated 16777216 bytes, ptr: 0x2b0e0000000
gpu-q-74:4164305:4164305 [0] NVSHMEM INFO [0] allocated 268435456 bytes, ptr: 0x28261000000
gpu-q-74:4164298:4164298 [0] NVSHMEM INFO [1] allocated 268435456 bytes, ptr: 0x14a401000000
gpu-q-74:4164303:4164303 [0] NVSHMEM INFO [2] allocated 268435456 bytes, ptr: 0x2b0e1000000
gpu-q-74:4164299:4164299 [0] NVSHMEM INFO [3] allocated 268435456 bytes, ptr: 0x14d5a1000000
gpu-q-74:4164299:4164299 [0] NVSHMEM INFO NVSHMEM_TEAM_SHARED: start=3, stride=1, size=1
gpu-q-74:4164303:4164303 [0] NVSHMEM INFO NVSHMEM_TEAM_SHARED: start=2, stride=1, size=1
gpu-q-74:4164298:4164298 [0] NVSHMEM INFO NVSHMEM_TEAM_SHARED: start=1, stride=1, size=1
gpu-q-74:4164305:4164305 [0] NVSHMEM INFO NVSHMEM_TEAM_SHARED: start=0, stride=1, size=1
gpu-q-74:4164305:4164305 [0] NVSHMEM INFO NVSHMEMX_TEAM_NODE: start=0, stride=1, size=4
gpu-q-74:4164305:4164305 [0] NVSHMEM INFO NVSHMEM_TEAM_SHARED: start=0, stride=4, size=1
gpu-q-74:4164303:4164303 [0] NVSHMEM INFO NVSHMEMX_TEAM_NODE: start=0, stride=1, size=4
gpu-q-74:4164303:4164303 [0] NVSHMEM INFO NVSHMEM_TEAM_SHARED: start=2, stride=4, size=1
gpu-q-74:4164303:4164303 [0] NVSHMEM INFO NVSHMEMI_TEAM_SAME_GPU: start=2, stride=1, size=1
gpu-q-74:4164298:4164298 [0] NVSHMEM INFO NVSHMEMX_TEAM_NODE: start=0, stride=1, size=4
gpu-q-74:4164305:4164305 [0] NVSHMEM INFO NVSHMEMI_TEAM_SAME_GPU: start=0, stride=1, size=1
gpu-q-74:4164299:4164299 [0] NVSHMEM INFO NVSHMEMX_TEAM_NODE: start=0, stride=1, size=4
gpu-q-74:4164299:4164299 [0] NVSHMEM INFO NVSHMEM_TEAM_SHARED: start=3, stride=4, size=1
gpu-q-74:4164299:4164299 [0] NVSHMEM INFO NVSHMEMI_TEAM_SAME_GPU: start=3, stride=1, size=1
gpu-q-74:4164298:4164298 [0] NVSHMEM INFO NVSHMEM_TEAM_SHARED: start=1, stride=4, size=1
gpu-q-74:4164298:4164298 [0] NVSHMEM INFO NVSHMEMI_TEAM_SAME_GPU: start=1, stride=1, size=1
gpu-q-74:4164298:4164298 [0] NVSHMEM INFO NVSHMEMI_TEAM_GPU_LEADERS: start=0, stride=1, size=4
gpu-q-74:4164305:4164305 [0] NVSHMEM INFO NVSHMEMI_TEAM_GPU_LEADERS: start=0, stride=1, size=4
gpu-q-74:4164299:4164299 [0] NVSHMEM INFO NVSHMEMI_TEAM_GPU_LEADERS: start=0, stride=1, size=4
gpu-q-74:4164303:4164303 [0] NVSHMEM INFO NVSHMEMI_TEAM_GPU_LEADERS: start=0, stride=1, size=4
gpu-q-74:4164303:4164303 [0] NVSHMEM INFO [2] allocated 128450560 bytes, ptr: 0x2b0f1000000
gpu-q-74:4164299:4164299 [0] NVSHMEM INFO [3] allocated 128450560 bytes, ptr: 0x14d5b1000000
gpu-q-74:4164305:4164305 [0] NVSHMEM INFO [0] allocated 128450560 bytes, ptr: 0x28271000000
gpu-q-74:4164298:4164298 [0] NVSHMEM INFO [1] allocated 128450560 bytes, ptr: 0x14a411000000
gpu-q-74:4164303:4164303 [0] NVSHMEM INFO [2] allocated 512 bytes, ptr: 0x2b0f8a80000
gpu-q-74:4164303:4164303 [0] NVSHMEM INFO [2] allocated 32 bytes, ptr: 0x2b0f8a80200
gpu-q-74:4164305:4164305 [0] NVSHMEM INFO [0] allocated 512 bytes, ptr: 0x28278a80000
gpu-q-74:4164303:4164303 [0] NVSHMEM INFO [2] allocated 8 bytes, ptr: 0x2b0f8a80400
gpu-q-74:4164298:4164298 [0] NVSHMEM INFO [1] allocated 512 bytes, ptr: 0x14a418a80000
gpu-q-74:4164305:4164305 [0] NVSHMEM INFO [0] allocated 32 bytes, ptr: 0x28278a80200
gpu-q-74:4164305:4164305 [0] NVSHMEM INFO [0] allocated 8 bytes, ptr: 0x28278a80400
gpu-q-74:4164298:4164298 [0] NVSHMEM INFO [1] allocated 32 bytes, ptr: 0x14a418a80200
gpu-q-74:4164299:4164299 [0] NVSHMEM INFO [3] allocated 512 bytes, ptr: 0x14d5b8a80000
gpu-q-74:4164298:4164298 [0] NVSHMEM INFO [1] allocated 8 bytes, ptr: 0x14a418a80400
gpu-q-74:4164299:4164299 [0] NVSHMEM INFO [3] allocated 32 bytes, ptr: 0x14d5b8a80200
gpu-q-74:4164299:4164299 [0] NVSHMEM INFO [3] allocated 8 bytes, ptr: 0x14d5b8a80400
For additional context the reason I am doubting the NVSHMEM setup is that I have an error in my code where all NVSHMEM PEs try to do nvshmem_double_put_nbi
but fail with segmentation faults after PE 1 and 3 start this process (mutliple runs and it’s always PE 1 and 3).
I know this is a long post so the TLDR here is: “How do I use NVSHMEM_DEBUG?”