It seems evident that nvshmem thinks there is only one PE. The rest of the output/behavior is consistent with that.
What is the configuration of your 4 GPUs? What system are they in? Can they all be in a peer relationship with each other? (deviceQuery can tell you this) How exactly are you compiling and running the program?
$ ./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 4 CUDA Capable device(s)
Device 0: "Tesla V100-SXM2-16GB"
CUDA Driver Version / Runtime Version 11.4 / 10.0
CUDA Capability Major/Minor version number: 7.0
Total amount of global memory: 16160 MBytes (16945512448 bytes)
(80) Multiprocessors, ( 64) CUDA Cores/MP: 5120 CUDA Cores
GPU Max Clock rate: 1530 MHz (1.53 GHz)
Memory Clock rate: 877 Mhz
Memory Bus Width: 4096-bit
L2 Cache Size: 6291456 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 6 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 21 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 1: "Tesla V100-SXM2-16GB"
CUDA Driver Version / Runtime Version 11.4 / 10.0
CUDA Capability Major/Minor version number: 7.0
Total amount of global memory: 16160 MBytes (16945512448 bytes)
(80) Multiprocessors, ( 64) CUDA Cores/MP: 5120 CUDA Cores
GPU Max Clock rate: 1530 MHz (1.53 GHz)
Memory Clock rate: 877 Mhz
Memory Bus Width: 4096-bit
L2 Cache Size: 6291456 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 6 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 22 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 2: "Tesla V100-SXM2-16GB"
CUDA Driver Version / Runtime Version 11.4 / 10.0
CUDA Capability Major/Minor version number: 7.0
Total amount of global memory: 16160 MBytes (16945512448 bytes)
(80) Multiprocessors, ( 64) CUDA Cores/MP: 5120 CUDA Cores
GPU Max Clock rate: 1530 MHz (1.53 GHz)
Memory Clock rate: 877 Mhz
Memory Bus Width: 4096-bit
L2 Cache Size: 6291456 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 6 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 58 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 3: "Tesla V100-SXM2-16GB"
CUDA Driver Version / Runtime Version 11.4 / 10.0
CUDA Capability Major/Minor version number: 7.0
Total amount of global memory: 16160 MBytes (16945512448 bytes)
(80) Multiprocessors, ( 64) CUDA Cores/MP: 5120 CUDA Cores
GPU Max Clock rate: 1530 MHz (1.53 GHz)
Memory Clock rate: 877 Mhz
Memory Bus Width: 4096-bit
L2 Cache Size: 6291456 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 6 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 59 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from Tesla V100-SXM2-16GB (GPU0) -> Tesla V100-SXM2-16GB (GPU1) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU0) -> Tesla V100-SXM2-16GB (GPU2) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU0) -> Tesla V100-SXM2-16GB (GPU3) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU1) -> Tesla V100-SXM2-16GB (GPU0) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU1) -> Tesla V100-SXM2-16GB (GPU2) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU1) -> Tesla V100-SXM2-16GB (GPU3) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU2) -> Tesla V100-SXM2-16GB (GPU0) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU2) -> Tesla V100-SXM2-16GB (GPU1) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU2) -> Tesla V100-SXM2-16GB (GPU3) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU3) -> Tesla V100-SXM2-16GB (GPU0) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU3) -> Tesla V100-SXM2-16GB (GPU1) : Yes
> Peer access from Tesla V100-SXM2-16GB (GPU3) -> Tesla V100-SXM2-16GB (GPU2) : Yes
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.4, CUDA Runtime Version = 10.0, NumDevs = 4
Result = PASS
I set up a machine with no IB adapters, and no slurm installed, but loaded the HPC_SDK container from ngc.
Initially, I had output like yours. After specifying MPI bootstrap, I was able to get expected output:
$ docker run --gpus all -it nvcr.io/nvidia/nvhpc:21.9-devel-cuda11.4-ubuntu20.04
Unable to find image 'nvcr.io/nvidia/nvhpc:21.9-devel-cuda11.4-ubuntu20.04' locally
21.9-devel-cuda11.4-ubuntu20.04: Pulling from nvidia/nvhpc
35807b77a593: Pull complete
aeeadb34b8bb: Pull complete
fd141d25b2de: Pull complete
fa267419451b: Pull complete
7b67f733c95c: Pull complete
64e5655a8a9c: Pull complete
00bc43ec6df4: Pull complete
e14d14601b52: Pull complete
9ed48a7e7dd0: Pull complete
e7067f225f43: Pull complete
Digest: sha256:d8c3f6ac1d2543c8453213d03c264476a125b63aa5af52f9f8f8583e67cdee10
Status: Downloaded newer image for nvcr.io/nvidia/nvhpc:21.9-devel-cuda11.4-ubuntu20.04
====================
== NVIDIA HPC SDK ==
====================
NVIDIA HPC SDK version 21.9
Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
root@fcdc66ae349b:/# nvidia-smi
Tue Aug 2 15:49:25 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:1B:00.0 Off | 0 |
| N/A 40C P0 57W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:1C:00.0 Off | Off |
| N/A 37C P0 57W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... Off | 00000000:3E:00.0 Off | Off |
| N/A 39C P0 56W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... Off | 00000000:3F:00.0 Off | Off |
| N/A 40C P0 59W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 Tesla V100-SXM2... Off | 00000000:89:00.0 Off | Off |
| N/A 37C P0 57W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 Tesla V100-SXM2... Off | 00000000:8A:00.0 Off | Off |
| N/A 38C P0 60W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 Tesla V100-SXM2... Off | 00000000:B2:00.0 Off | Off |
| N/A 37C P0 57W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 Tesla V100-SXM2... Off | 00000000:B3:00.0 Off | Off |
| N/A 37C P0 56W / 300W | 0MiB / 32510MiB | 3% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
# cat t2.cu
#include <stdio.h>
#include <cuda.h>
#include <nvshmem.h>
#include <nvshmemx.h>
__global__ void simple_shift(int *destination) {
int mype = nvshmem_my_pe();
int npes = nvshmem_n_pes();
int peer = (mype + 1) % npes;
nvshmem_int_p(destination, mype, peer);
}
int main(void) {
int mype_node, msg;
cudaStream_t stream;
nvshmem_init();
mype_node = nvshmem_team_my_pe(NVSHMEMX_TEAM_NODE);
cudaSetDevice(mype_node);
cudaStreamCreate(&stream);
int *destination = (int *) nvshmem_malloc(sizeof(int));
simple_shift<<<1, 1, 0, stream>>>(destination);
nvshmemx_barrier_all_on_stream(stream);
cudaMemcpyAsync(&msg, destination, sizeof(int), cudaMemcpyDeviceToHost, stream);
cudaStreamSynchronize(stream);
printf("%d: received message %d\n", nvshmem_my_pe(), msg);
nvshmem_free(destination);
nvshmem_finalize();
return 0;
}
# export NVSHMEM_HOME=/opt/nvidia/hpc_sdk/Linux_x86_64/21.9/comm_libs/nvshmem
# nvcc -rdc=true -arch=sm_70 -I$NVSHMEM_HOME/include t2.cu -o t2 -L$NVSHMEM_HOME/lib -lnvshmem -lcuda
# mpirun -n 4 --allow-run-as-root ./t2
src/comm/transports/ibrc/ibrc.cpp:1533: NULL value get_device_list failed
src/comm/transports/ibrc/ibrc.cpp:1533: NULL value get_device_list failed
src/comm/transports/ibrc/ibrc.cpp:1533: NULL value get_device_list failed
src/comm/transports/ibrc/ibrc.cpp:1533: NULL value get_device_list failed
0: received message 0
0: received message 0
0: received message 0
0: received message 0
# NVSHMEM_BOOTSTRAP="MPI" mpirun --allow-run-as-root -np 4 ./t2
[LOG_CAT_ML] Unable to get list of available IB devices (ibv_get_device_list failed)
[LOG_CAT_ML] You must specify a valid HCA device by setting:
-x HCOLL_MAIN_IB=<dev_name:port> or -x UCX_NET_DEVICES=<dev_name:port>.
If no device was specified for HCOLL (or the calling library), automatic device detection will be run.
In case of unfounded HCA device please contact your system administrator.
[LOG_CAT_ML] Unable to get list of available IB devices (ibv_get_device_list failed)
[LOG_CAT_ML] You must specify a valid HCA device by setting:
-x HCOLL_MAIN_IB=<dev_name:port> or -x UCX_NET_DEVICES=<dev_name:port>.
If no device was specified for HCOLL (or the calling library), automatic device detection will be run.
In case of unfounded HCA device please contact your system administrator.
[LOG_CAT_ML] Unable to get list of available IB devices (ibv_get_device_list failed)
[LOG_CAT_ML] You must specify a valid HCA device by setting:
-x HCOLL_MAIN_IB=<dev_name:port> or -x UCX_NET_DEVICES=<dev_name:port>.
If no device was specified for HCOLL (or the calling library), automatic device detection will be run.
In case of unfounded HCA device please contact your system administrator.
[fcdc66ae349b:00466] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[fcdc66ae349b:00468] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[fcdc66ae349b:00469] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[LOG_CAT_ML] Unable to get list of available IB devices (ibv_get_device_list failed)
[LOG_CAT_ML] You must specify a valid HCA device by setting:
-x HCOLL_MAIN_IB=<dev_name:port> or -x UCX_NET_DEVICES=<dev_name:port>.
If no device was specified for HCOLL (or the calling library), automatic device detection will be run.
In case of unfounded HCA device please contact your system administrator.
[fcdc66ae349b:00467] Error: ../../../../../ompi/mca/coll/hcoll/coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
src/comm/transports/ibrc/ibrc.cpp:1533: NULL value get_device_list failed
src/comm/transports/ibrc/ibrc.cpp:1533: NULL value get_device_list failed
src/comm/transports/ibrc/ibrc.cpp:1533: NULL value get_device_list failed
src/comm/transports/ibrc/ibrc.cpp:1533: NULL value get_device_list failed
0: received message 3
1: received message 0
2: received message 1
3: received message 2
#
The warning:
src/comm/transports/ibrc/ibrc.cpp:1533: NULL value get_device_list failed
is due to not having IB setup properly. The remainder of the spew prior to the actual program output is for the same reason (no IB, and I haven’t gone through the process of asking MPI to disable checking for IB.)
Yes, its possible to not use MPI. nvshmem requires a “bootstrap”. If you read the document I indicated, it demonstrates what kind of build/install/setup is needed to use the various bootstrap options: bootstrap via PMI/nvshmrun, bootstrap via MPI, and bootstrap via slurm.
$ NVSHMEM_BOOTSTRAP="MPI" mpirun --allow-run-as-root -np 4 ../../bin/test_nvshmem
There are 4 GPUs available
There are 4 GPUs available
There are 4 GPUs available
There are 4 GPUs available
npes = 4, mype = 0, mype_node = 0
npes = 4, mype = 1, mype_node = 1
npes = 4, mype = 2, mype_node = 2
npes = 4, mype = 3, mype_node = 3
2: received message 1
0: received message 3
3: received message 2
1: received message 0
Now I installed hydra, and then got this:
$ export NVSHMEM_BOOTSTRAP="PMI"
$ export NVSHMEM_BOOTSTRAP_PMI="PMI"
$ nvshmrun ../../bin/test_nvshmem
There are 4 GPUs available
There are 4 GPUs available
There are 4 GPUs available
There are 4 GPUs available
npes = 4, mype = 0, mype_node = 0
npes = 4, mype = 1, mype_node = 1
npes = 4, mype = 2, mype_node = 2
npes = 4, mype = 3, mype_node = 3
0: received message 3
1: received message 0
2: received message 1
3: received message 2
Seems like even if I use nvshmrun, it still has to launche 4 processes, instead of 1 single process, which is the same as what MPI does, correct?
Yes, nvshmrun is a launcher something like mpirun. The example code is designed to work that way. If you study the code, there is only one call to cudaSetDevice and it is not in a loop. So each process (however you launch them) will only talk to one GPU.
NVSHMEM requires multiple processes, where each process is treated as a PE. So you cannot avoid multiple processes.
You can also check the bootstraps supported by srun by running: srun --mpi=list.
NVSHMEM comes with in-built support for mpi, pmi, pmi2, and, pmix bootstraps. You can always choose one of the bootstraps from the list that NVSHMEM supports and specify that to NVSHMEM using NVSHMEM_BOOTSTRAP env varaiable. Something like this (say for pmi2 bootstrap):