NVSHMEM runtime error

cxh · August 1, 2022, 6:30pm

Hello all,

I was trying to run this nvshmem example using 4 GPUs:
https://docs.nvidia.com/hpc-sdk/nvshmem/api/using.html#example-nvshmem-program
which is supposed to get this output:

0: received message 3
1: received message 0
2: received message 1
3: received message 2

But what I got is:
0: received message 0

I also tried to print more information:

int npes = nvshmem_n_pes();
int mype = nvshmem_my_pe();
mype_node = nvshmem_team_my_pe(NVSHMEMX_TEAM_NODE);
std::cout << "npes = " << npes << ", mype = " << mype << ", mype_node = " << mype_node << "\n";

What I got is:
npes = 1, mype = 0, mype_node = 0

Any idea?

Robert_Crovella · August 1, 2022, 7:05pm

It seems evident that nvshmem thinks there is only one PE. The rest of the output/behavior is consistent with that.

What is the configuration of your 4 GPUs? What system are they in? Can they all be in a peer relationship with each other? (deviceQuery can tell you this) How exactly are you compiling and running the program?

Have you studied all the troubleshooting FAQs ?

cxh · August 1, 2022, 7:17pm

github.com

chenxuhao/GraphAIBench/blob/main/src/test/test_nvshmem.cu

#include <cuda.h>
#include <stdio.h>
#include <iostream>
#include <nvshmem.h>
#include <nvshmemx.h>

#define CUDA_CHECK(stmt)                                \
do {                                                    \
  cudaError_t result = (stmt);                          \
  if (cudaSuccess != result) {                          \
    fprintf(stderr, "[%s:%d] CUDA failed with %s \n",   \
        __FILE__, __LINE__, cudaGetErrorString(result));\
    exit(-1);                                           \
  }                                                     \
} while (0)


__global__ void simple_shift(int *destination) {
  int mype = nvshmem_my_pe();
  int npes = nvshmem_n_pes();

This file has been truncated. show original

I just compile and run it like this:

make test_nvshmem
./bin/test_nvshmem

Tried this as well, which has the same output:

CUDA_VISIBLE_DEVICES=0,1,2,3 srun --cpu-bind=socket --gres=gpu:4 ./bin/test_nvshmem

I have 4 V100 GPUs:

$ ./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 4 CUDA Capable device(s)

Device 0: "Tesla V100-SXM2-16GB"
  CUDA Driver Version / Runtime Version          11.4 / 10.0
  CUDA Capability Major/Minor version number:    7.0
  Total amount of global memory:                 16160 MBytes (16945512448 bytes)
  (80) Multiprocessors, ( 64) CUDA Cores/MP:     5120 CUDA Cores
  GPU Max Clock rate:                            1530 MHz (1.53 GHz)
  Memory Clock rate:                             877 Mhz
  Memory Bus Width:                              4096-bit
  L2 Cache Size:                                 6291456 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 6 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 21 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "Tesla V100-SXM2-16GB"
  CUDA Driver Version / Runtime Version          11.4 / 10.0
  CUDA Capability Major/Minor version number:    7.0
  Total amount of global memory:                 16160 MBytes (16945512448 bytes)
  (80) Multiprocessors, ( 64) CUDA Cores/MP:     5120 CUDA Cores
  GPU Max Clock rate:                            1530 MHz (1.53 GHz)
  Memory Clock rate:                             877 Mhz
  Memory Bus Width:                              4096-bit
  L2 Cache Size:                                 6291456 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 6 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 22 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 2: "Tesla V100-SXM2-16GB"
  CUDA Driver Version / Runtime Version          11.4 / 10.0
  CUDA Capability Major/Minor version number:    7.0
  Total amount of global memory:                 16160 MBytes (16945512448 bytes)
  (80) Multiprocessors, ( 64) CUDA Cores/MP:     5120 CUDA Cores
  GPU Max Clock rate:                            1530 MHz (1.53 GHz)
  Memory Clock rate:                             877 Mhz
  Memory Bus Width:                              4096-bit
  L2 Cache Size:                                 6291456 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 6 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 58 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 3: "Tesla V100-SXM2-16GB"
  CUDA Driver Version / Runtime Version          11.4 / 10.0
  CUDA Capability Major/Minor version number:    7.0
  Total amount of global memory:                 16160 MBytes (16945512448 bytes)
  (80) Multiprocessors, ( 64) CUDA Cores/MP:     5120 CUDA Cores
  GPU Max Clock rate:                            1530 MHz (1.53 GHz)
  Memory Clock rate:                             877 Mhz
  Memory Bus Width:                              4096-bit
  L2 Cache Size:                                 6291456 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 6 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 59 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from Tesla V100-SXM2-16GB (GPU0) -> Tesla V100-SXM2-16GB (GPU1) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU0) -> Tesla V100-SXM2-16GB (GPU2) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU0) -> Tesla V100-SXM2-16GB (GPU3) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU1) -> Tesla V100-SXM2-16GB (GPU0) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU1) -> Tesla V100-SXM2-16GB (GPU2) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU1) -> Tesla V100-SXM2-16GB (GPU3) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU2) -> Tesla V100-SXM2-16GB (GPU0) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU2) -> Tesla V100-SXM2-16GB (GPU1) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU2) -> Tesla V100-SXM2-16GB (GPU3) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU3) -> Tesla V100-SXM2-16GB (GPU0) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU3) -> Tesla V100-SXM2-16GB (GPU1) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU3) -> Tesla V100-SXM2-16GB (GPU2) : Yes

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.4, CUDA Runtime Version = 10.0, NumDevs = 4
Result = PASS

Robert_Crovella · August 2, 2022, 6:02pm

Do you have any setting for NVSHMEM_BOOTSTRAP env var?

I set up a machine with no IB adapters, and no slurm installed, but loaded the HPC_SDK container from ngc.

Initially, I had output like yours. After specifying MPI bootstrap, I was able to get expected output:

$ docker run --gpus all -it nvcr.io/nvidia/nvhpc:21.9-devel-cuda11.4-ubuntu20.04
Unable to find image 'nvcr.io/nvidia/nvhpc:21.9-devel-cuda11.4-ubuntu20.04' locally
21.9-devel-cuda11.4-ubuntu20.04: Pulling from nvidia/nvhpc
35807b77a593: Pull complete
aeeadb34b8bb: Pull complete
fd141d25b2de: Pull complete
fa267419451b: Pull complete
7b67f733c95c: Pull complete
64e5655a8a9c: Pull complete
00bc43ec6df4: Pull complete
e14d14601b52: Pull complete
9ed48a7e7dd0: Pull complete
e7067f225f43: Pull complete
Digest: sha256:d8c3f6ac1d2543c8453213d03c264476a125b63aa5af52f9f8f8583e67cdee10
Status: Downloaded newer image for nvcr.io/nvidia/nvhpc:21.9-devel-cuda11.4-ubuntu20.04

====================
== NVIDIA HPC SDK ==
====================

NVIDIA HPC SDK version 21.9

Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

root@fcdc66ae349b:/# nvidia-smi
Tue Aug  2 15:49:25 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:1B:00.0 Off |                    0 |
| N/A   40C    P0    57W / 300W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:1C:00.0 Off |                  Off |
| N/A   37C    P0    57W / 300W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:3E:00.0 Off |                  Off |
| N/A   39C    P0    56W / 300W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:3F:00.0 Off |                  Off |
| N/A   40C    P0    59W / 300W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  Off  | 00000000:89:00.0 Off |                  Off |
| N/A   37C    P0    57W / 300W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  Off  | 00000000:8A:00.0 Off |                  Off |
| N/A   38C    P0    60W / 300W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  Off  | 00000000:B2:00.0 Off |                  Off |
| N/A   37C    P0    57W / 300W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  Off  | 00000000:B3:00.0 Off |                  Off |
| N/A   37C    P0    56W / 300W |      0MiB / 32510MiB |      3%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

# cat t2.cu
#include <stdio.h>
#include <cuda.h>
#include <nvshmem.h>
#include <nvshmemx.h>

__global__ void simple_shift(int *destination) {
    int mype = nvshmem_my_pe();
    int npes = nvshmem_n_pes();
    int peer = (mype + 1) % npes;

    nvshmem_int_p(destination, mype, peer);
}

int main(void) {
    int mype_node, msg;
    cudaStream_t stream;

    nvshmem_init();
    mype_node = nvshmem_team_my_pe(NVSHMEMX_TEAM_NODE);
    cudaSetDevice(mype_node);
    cudaStreamCreate(&stream);

    int *destination = (int *) nvshmem_malloc(sizeof(int));

    simple_shift<<<1, 1, 0, stream>>>(destination);
    nvshmemx_barrier_all_on_stream(stream);
    cudaMemcpyAsync(&msg, destination, sizeof(int), cudaMemcpyDeviceToHost, stream);

    cudaStreamSynchronize(stream);
    printf("%d: received message %d\n", nvshmem_my_pe(), msg);

    nvshmem_free(destination);
    nvshmem_finalize();
    return 0;
}
# export NVSHMEM_HOME=/opt/nvidia/hpc_sdk/Linux_x86_64/21.9/comm_libs/nvshmem
# nvcc -rdc=true -arch=sm_70 -I$NVSHMEM_HOME/include t2.cu -o t2 -L$NVSHMEM_HOME/lib -lnvshmem -lcuda
# mpirun -n 4 --allow-run-as-root ./t2
src/comm/transports/ibrc/ibrc.cpp:1533: NULL value get_device_list failed
src/comm/transports/ibrc/ibrc.cpp:1533: NULL value get_device_list failed
src/comm/transports/ibrc/ibrc.cpp:1533: NULL value get_device_list failed
src/comm/transports/ibrc/ibrc.cpp:1533: NULL value get_device_list failed
0: received message 0
0: received message 0
0: received message 0
0: received message 0
# NVSHMEM_BOOTSTRAP="MPI" mpirun --allow-run-as-root -np 4 ./t2
[LOG_CAT_ML] Unable to get list of available IB devices (ibv_get_device_list failed)
[LOG_CAT_ML] You must specify a valid HCA device by setting:
-x HCOLL_MAIN_IB=<dev_name:port> or -x UCX_NET_DEVICES=<dev_name:port>.
If no device was specified for HCOLL (or the calling library), automatic device detection will be run.
In case of unfounded HCA device please contact your system administrator.
[LOG_CAT_ML] Unable to get list of available IB devices (ibv_get_device_list failed)
[LOG_CAT_ML] You must specify a valid HCA device by setting:
-x HCOLL_MAIN_IB=<dev_name:port> or -x UCX_NET_DEVICES=<dev_name:port>.
If no device was specified for HCOLL (or the calling library), automatic device detection will be run.
In case of unfounded HCA device please contact your system administrator.
[LOG_CAT_ML] Unable to get list of available IB devices (ibv_get_device_list failed)
[LOG_CAT_ML] You must specify a valid HCA device by setting:
-x HCOLL_MAIN_IB=<dev_name:port> or -x UCX_NET_DEVICES=<dev_name:port>.
If no device was specified for HCOLL (or the calling library), automatic device detection will be run.
In case of unfounded HCA device please contact your system administrator.
[fcdc66ae349b:00466] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[fcdc66ae349b:00468] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[fcdc66ae349b:00469] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[LOG_CAT_ML] Unable to get list of available IB devices (ibv_get_device_list failed)
[LOG_CAT_ML] You must specify a valid HCA device by setting:
-x HCOLL_MAIN_IB=<dev_name:port> or -x UCX_NET_DEVICES=<dev_name:port>.
If no device was specified for HCOLL (or the calling library), automatic device detection will be run.
In case of unfounded HCA device please contact your system administrator.
[fcdc66ae349b:00467] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
src/comm/transports/ibrc/ibrc.cpp:1533: NULL value get_device_list failed
src/comm/transports/ibrc/ibrc.cpp:1533: NULL value get_device_list failed
src/comm/transports/ibrc/ibrc.cpp:1533: NULL value get_device_list failed
src/comm/transports/ibrc/ibrc.cpp:1533: NULL value get_device_list failed
0: received message 3
1: received message 0
2: received message 1
3: received message 2
#

The warning:

src/comm/transports/ibrc/ibrc.cpp:1533: NULL value get_device_list failed

is due to not having IB setup properly. The remainder of the spew prior to the actual program output is for the same reason (no IB, and I haven’t gone through the process of asking MPI to disable checking for IB.)

Robert_Crovella · August 2, 2022, 8:00pm

for usage with slurm, or general further troubleshooting, I suggest studying this doc

For slurm, you will probably want to set your NVSHMEM_BOOTSTRAP_PMI env var to PMI-2 or PMIX

cxh · August 3, 2022, 2:53pm

Thank you so much. Adding NVSHMEM_BOOTSTRAP=“MPI” works for me.
But is it possible to not use MPI? I only want to use NVSHMEM in program for now.

Robert_Crovella · August 3, 2022, 2:55pm

Yes, its possible to not use MPI. nvshmem requires a “bootstrap”. If you read the document I indicated, it demonstrates what kind of build/install/setup is needed to use the various bootstrap options: bootstrap via PMI/nvshmrun, bootstrap via MPI, and bootstrap via slurm.

cxh · August 3, 2022, 3:27pm

Thanks for the links. Below is what I tried:

$ NVSHMEM_BOOTSTRAP="MPI" mpirun --allow-run-as-root -np 4 ../../bin/test_nvshmem
There are 4 GPUs available
There are 4 GPUs available
There are 4 GPUs available
There are 4 GPUs available
npes = 4, mype = 0, mype_node = 0
npes = 4, mype = 1, mype_node = 1
npes = 4, mype = 2, mype_node = 2
npes = 4, mype = 3, mype_node = 3
2: received message 1
0: received message 3
3: received message 2
1: received message 0

Now I installed hydra, and then got this:

$ export NVSHMEM_BOOTSTRAP="PMI"
$ export NVSHMEM_BOOTSTRAP_PMI="PMI"
$ nvshmrun ../../bin/test_nvshmem
There are 4 GPUs available
There are 4 GPUs available
There are 4 GPUs available
There are 4 GPUs available
npes = 4, mype = 0, mype_node = 0
npes = 4, mype = 1, mype_node = 1
npes = 4, mype = 2, mype_node = 2
npes = 4, mype = 3, mype_node = 3
0: received message 3
1: received message 0
2: received message 1
3: received message 2

Seems like even if I use nvshmrun, it still has to launche 4 processes, instead of 1 single process, which is the same as what MPI does, correct?

Robert_Crovella · August 3, 2022, 3:43pm

Yes, nvshmrun is a launcher something like mpirun. The example code is designed to work that way. If you study the code, there is only one call to cudaSetDevice and it is not in a loop. So each process (however you launch them) will only talk to one GPU.

cxh · August 3, 2022, 5:50pm

Do you mean if I modify the code to use a loop that calls cudaSetDevice multiple times (each for one device), I can avoid having multiple processes?

Robert_Crovella · August 6, 2022, 6:43pm

I don’t have any suggestions for how to use nvshmem without one of the multi-process loaders that the documentation indicates as necessary.

alanger · August 16, 2022, 5:15am

NVSHMEM requires multiple processes, where each process is treated as a PE. So you cannot avoid multiple processes.

You can also check the bootstraps supported by srun by running: srun --mpi=list.
NVSHMEM comes with in-built support for mpi, pmi, pmi2, and, pmix bootstraps. You can always choose one of the bootstraps from the list that NVSHMEM supports and specify that to NVSHMEM using NVSHMEM_BOOTSTRAP env varaiable. Something like this (say for pmi2 bootstrap):

srun --mpi=pmi2 --export=NVSHMEM_BOOTSTRAP=PMI2 ........

Topic		Replies	Views
NVSHMEM on multi-node GPUs failed . My gpu is A5000 GPU-Accelerated Libraries nvshmem	5	936	April 1, 2024
Error when running NVSHMEM perftest GPU-Accelerated Libraries nvshmem	3	206	January 16, 2025
Internode nvshmme and ib problem GPU-Accelerated Libraries nvshmem	20	1282	April 24, 2024
NVSHMEM setup GPU-Accelerated Libraries gpu-computing	0	63	October 6, 2024
Broken GPU state query failure in AMD + H100 Confidential Computing	10	1074	February 15, 2024
nvidia-smi "No devices were found" error CUDA Setup and Installation	23	62552	February 14, 2021
ENOMEM when running CUDA sample on host GPU where another GPU is passed through via IOMMU/vfio-pci Linux	1	773	May 19, 2019
340.106 nvidia-uvm.ko fails to build under kernel 4.14.y Linux	16	7323	October 14, 2021
Nvidia-uvm module bug on suspend Linux	14	1770	December 7, 2023
Nvidia process not running Linux	25	2855	December 31, 2021

NVSHMEM runtime error

Related topics