Seg fault on program end when using NVSHMEM and cuBLAS

I’m encountering an odd seg fault on program end when I launch a cuBLAS kernel kernel after NVSHMEM has been initialized.

  • The cuBLAS kernels complete successfully without throwing any errors, returning the correct result.
  • NVSHMEM finalizes successfully without throwing any errors.
  • On program end, a seg fault occurs.
  • If I do not initialize NVSHMEM, everything completes successfully.
  • NVSHMEM is otherwise working normally on this system, and I am observing the bandwidth I expect over NVLink.

I am building with NVSHMEM 3.2.5-1, build from source with CUDA 12.5.82. I have attached a minimal reproducer with a CMake build as a tarball below. I am running on a DGX with 8 H100s and am using only NVLink (no cross-node interconnect is available).

Here is the simplified example that I’ve attached:

  nvshmem_init();

  int node_pe = nvshmem_team_my_pe(NVSHMEMX_TEAM_NODE);
  CUDA_CHECK(cudaSetDevice(node_pe));

  using T = float;

  std::size_t m = 16;
  std::size_t n = 16;
  std::size_t k = 16;

  std::vector<T> a(m * k, 1);
  std::vector<T> b(k * n, 2);
  std::vector<T> c(m * n, 0);

  T* d_a;
  T* d_b;
  T* d_c;

  CUDA_CHECK(cudaMalloc((void**) &d_a, sizeof(T) * m * k));
  CUDA_CHECK(cudaMalloc((void**) &d_b, sizeof(T) * k * n));
  CUDA_CHECK(cudaMalloc((void**) &d_c, sizeof(T) * m * n));

  CUDA_CHECK(cudaMemcpy(d_a, a.data(), sizeof(T) * m * k, cudaMemcpyDefault));
  CUDA_CHECK(cudaMemcpy(d_b, b.data(), sizeof(T) * k * n, cudaMemcpyDefault));
  CUDA_CHECK(cudaMemcpy(d_c, c.data(), sizeof(T) * m * n, cudaMemcpyDefault));

  cublasHandle_t handle;

  CUBLAS_CHECK(cublasCreate(&handle));

  const T alpha = 1;
  const T beta = 1;

  cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, m, n, k, &alpha, d_a, m, d_b, k,
              &beta, d_c, m);

  CUDA_CHECK(cudaDeviceSynchronize());

  CUBLAS_CHECK(cublasDestroy(handle));

  CUDA_CHECK(cudaFree(d_a));
  CUDA_CHECK(cudaFree(d_b));
  CUDA_CHECK(cudaFree(d_c));

  CUDA_CHECK(cudaDeviceSynchronize());
  nvshmem_barrier_all();
  nvshmem_finalize();

This throws the following seg fault, which occurs after nvshmem_finalize on program end:

xiii@slothius:~/src/ai/nvshmem-cublas/build/src> mpirun -n 1 ./nvshmem_example
[slothius:185973:0:185973] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x14fd567be720)
Segmentation fault (core dumped)

If nvshmem_init() and nvshmem_finalize() are commented out, no error occurs.

I am running with the following runtime flags:

export NVSHMEM_REMOTE_TRANSPORT=none
export NVSHMEM_BOOTSTRAP=MPI
export NVSHMEM_HOME=/home/slothius/pkg/nvshmem_3.2.5-1
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/slothius/pkg/nvshmem_3.2.5-1/lib
export CMAKE_PREFIX_PATH=$CMAKE_PREFIX_PATH:/home/slothius/pkg/nvshmem_3.2.5-1

I ran the below minimal reproducer with compute-sanitizer. I observed very many warnings about calls to cuPointerGetAttribute with host pointers inside of MPI. I assume this is from NVSHMEM interacting with MPI. From some quick Googling this appears to be a known issue (“[these] are benign errors” and “this is very annoying if we want to run cuda-memcheck and somewhere in the code there is a call to this function”).

nvshmem-cublas.tar.gz (2.1 KB)

This issue seems to be related to using an MPI launcher:

  • If I set NVSHMEM_BOOTSTRAP=pmi, there’s no seg fault (but I don’t have a proper PMI launcher on this system, so this doesn’t work for my real use case)
  • If I use an Open MPI build that’s not CUDA-aware, there’s no seg fault.
  • If I use nvshmemx_init_attr to initialize NVSHMEM alongside MPI, there’s no seg fault.

I will probably work around this for now by initializing with MPI if available. However, this does seem to be some kind of bug, so I would appreciate an update if there are any other workarounds or fixes.

Thanks for reporting this issue. I have a filed an internal defect to track this issue. Given that you have a workaround for now, you should be unblocked for your usage of NVSHMEM.