BUG: call cudaFree(0) before nvshmem_init() makes nvshmem_barrier_all() fails

The problems

when I run nvshmem I got the problem:

then I found if I run cudaFree(0) before nvshmem_init, then I got problem with nvshmem_barrier_all.
calling CUDA runtime API such as cudaMalloc/cudaFuncSetAttribute also makes the problem.

move cudaFree(0) right after nvshmem_init helps solve the problem.

Details

nvshmem version: 3.2.5

build nvshmem with :

export NVSHMEM_IBGDA_SUPPORT=0
export NVSHMEM_IBGDA_SUPPORT_GPUMEM_ONLY=0
export NVSHMEM_IBDEVX_SUPPORT=0
export NVSHMEM_IBRC_SUPPORT=1
export NVSHMEM_LIBFABRIC_SUPPORT=0
export NVSHMEM_MPI_SUPPORT=1
export NVSHMEM_USE_GDRCOPY=0
export NVSHMEM_ENABLE_ALL_DEVICE_INLINING=1

    cmake .. \
        -DCMAKE_EXPORT_COMPILE_COMMANDS=1 \
        -DCMAKE_CUDA_ARCHITECTURES=90 \
        -DNVSHMEM_BUILD_TESTS=ON \
        -DNVSHMEM_BUILD_EXAMPLES=OFF \
        -DNVSHMEM_BUILD_PACKAGES=OFF

The sample code:


#include <cstdlib>
#include <cuda_runtime.h>

#include <nvshmem.h>
#include <nvshmemx.h>

#undef CUDA_CHECK
#define CUDA_CHECK(stmt)                                                       \
  do {                                                                         \
    cudaError_t result = (stmt);                                               \
    if (cudaSuccess != result) {                                               \
      fprintf(stderr, "[%s:%d] cuda failed with %s \n", __FILE__, __LINE__,    \
              cudaGetErrorString(result));                                     \
      exit(-1);                                                                \
    }                                                                          \
  } while (0)

using namespace std;

int main(int c, char *v[]) {
  CUDA_CHECK(cudaFree(0));   // ** run this first cause nvshmem_barrier_all failed. comment this line, this program runs well. **

  int mype, npes, mype_node;
  fprintf(stderr, "nvshmem_init...\n");
  nvshmem_init();
  fprintf(stderr, "nvshmem_init done...\n");

  mype = nvshmem_my_pe();
  npes = nvshmem_n_pes();
  mype_node = nvshmem_team_my_pe(NVSHMEMX_TEAM_NODE);
  CUDA_CHECK(cudaSetDevice(mype_node));
  nvshmem_barrier_all();
  fprintf(stderr, "nvshmem_barrier_all done...\n");
  CUDA_CHECK(cudaDeviceSynchronize());

  CUDA_CHECK(cudaDeviceSynchronize());
  nvshmem_finalize();
  return 0;
}

compiles with

NVSHMEM_HOME=/path/to/nvshmem
nvcc -std=c++17 -O3 -arch=sm_90 --compiler-options="-O2 -pipe -march=native -Wall -fopenmp" -Xcompiler -rdynamic --generate-line-info  -I${NVSHMEM_HOME}/include -L${NVSHMEM_HOME}/lib -lnvshmem -o main-nvshmem main-nvshmem.cu -L/opt/cuda/lib64 -lcublas -lnvidia-ml

runs with this command

NVSHMEM_DEBUG=INFO NVSHMEM_DEBUG_SUBSYS=ALL NVSHMEM_DISABLE_CUDA_VMM=1 nvshmrun -ppn 2 -n 2 main-nvshmem

i got the error:

runs with this command

NVSHMEM_DEBUG=INFO NVSHMEM_DEBUG_SUBSYS=ALL NVSHMEM_DISABLE_CUDA_VMM=0 nvshmrun -ppn 2 -n 2 main-nvshmem

then it fails in nvshmem_init (on H800)

is this the reason?

cudaFree(0) sets the device to 0. All PEs are getting assigned to device 0 before nvshmem_init() is called, which is not your intention. Therefore, the solution is to either not call cudaFree(0) before nvshmem_init() or explicitly set the device usign cudaSetDevice() before calling nvshmem_init.

I found no such thing “call cudaSetDevice first, call nvshmem_init then” from the doc, did I miss something?

and if cudaSetDevice is neccessary, maybe it should resident int the nvshmem_init implementation, and avoid the user to manually add it?

or some WARNING messages helps a lot. It toke me so long to find the problem.

This is indeed a tricky interaction between the CUDA programming model and the comms library. In the call to nvshmem_init we do check whether the user has assigned a CUDA device to the process. If they haven’t, we delay the initialization to the first nvshmem_malloc call so that you can use the node PE ID as a convenience when assigning the CUDA device (as shown in the sample code you shared above). If the device ID still isn’t assigned at the call to nvshmem_malloc, then we should report an error.

As @alanger mentioned, other CUDA calls can cause CUDA to set the device for the process and that will can result in processes sharing GPUs. We do warn about that (MPG = Multiple Processes per GPU) since it limits the NVSHMEM functionality; however, we also support MPG mode as many users find it to be helpful for development and debugging.

Regarding setting the ID automatically, we could consider it as a convenience mode for nvshmemx_init_attr. Since the application will be running CUDA kernels, we’ve found that most users also prefer to explicitly set the CUDA device.

Thanks for your reply.

so I don’t have to call cudaSetDevice before nvshmemx_init_attr, nvshmem will take care of that for me, right?

so I don’t have to call cudaSetDevice before nvshmemx_init_attr, nvshmem will take care of that for me, right?

I think what @jdinan is trying to say, we can consider it as a new feature request to nvshmemx_init_attr in a future release. It doesn’t exist today in NVSHMEM 3.2.5.
Users of NVSHMEM are required to do one of the following -

(a) Set the CUDA device explicitly today either before nvshmem_init as shown below:

// assuming devID is unique per PE in your usecase
// e.g. devID = myrank % num_ranks_per_node;
cudaSetDevice(devID)
nvshmem_init()
....
nvshmem_finalize()

(b) Or after nvshmem_init, but before nvshmem_malloc, as described here.