I’m encountering an odd seg fault on program end when I launch a cuBLAS kernel kernel after NVSHMEM has been initialized.
- The cuBLAS kernels complete successfully without throwing any errors, returning the correct result.
- NVSHMEM finalizes successfully without throwing any errors.
- On program end, a seg fault occurs.
- If I do not initialize NVSHMEM, everything completes successfully.
- NVSHMEM is otherwise working normally on this system, and I am observing the bandwidth I expect over NVLink.
I am building with NVSHMEM 3.2.5-1, build from source with CUDA 12.5.82. I have attached a minimal reproducer with a CMake build as a tarball below. I am running on a DGX with 8 H100s and am using only NVLink (no cross-node interconnect is available).
Here is the simplified example that I’ve attached:
nvshmem_init();
int node_pe = nvshmem_team_my_pe(NVSHMEMX_TEAM_NODE);
CUDA_CHECK(cudaSetDevice(node_pe));
using T = float;
std::size_t m = 16;
std::size_t n = 16;
std::size_t k = 16;
std::vector<T> a(m * k, 1);
std::vector<T> b(k * n, 2);
std::vector<T> c(m * n, 0);
T* d_a;
T* d_b;
T* d_c;
CUDA_CHECK(cudaMalloc((void**) &d_a, sizeof(T) * m * k));
CUDA_CHECK(cudaMalloc((void**) &d_b, sizeof(T) * k * n));
CUDA_CHECK(cudaMalloc((void**) &d_c, sizeof(T) * m * n));
CUDA_CHECK(cudaMemcpy(d_a, a.data(), sizeof(T) * m * k, cudaMemcpyDefault));
CUDA_CHECK(cudaMemcpy(d_b, b.data(), sizeof(T) * k * n, cudaMemcpyDefault));
CUDA_CHECK(cudaMemcpy(d_c, c.data(), sizeof(T) * m * n, cudaMemcpyDefault));
cublasHandle_t handle;
CUBLAS_CHECK(cublasCreate(&handle));
const T alpha = 1;
const T beta = 1;
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, m, n, k, &alpha, d_a, m, d_b, k,
&beta, d_c, m);
CUDA_CHECK(cudaDeviceSynchronize());
CUBLAS_CHECK(cublasDestroy(handle));
CUDA_CHECK(cudaFree(d_a));
CUDA_CHECK(cudaFree(d_b));
CUDA_CHECK(cudaFree(d_c));
CUDA_CHECK(cudaDeviceSynchronize());
nvshmem_barrier_all();
nvshmem_finalize();
This throws the following seg fault, which occurs after nvshmem_finalize
on program end:
xiii@slothius:~/src/ai/nvshmem-cublas/build/src> mpirun -n 1 ./nvshmem_example
[slothius:185973:0:185973] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x14fd567be720)
Segmentation fault (core dumped)
If nvshmem_init()
and nvshmem_finalize()
are commented out, no error occurs.
I am running with the following runtime flags:
export NVSHMEM_REMOTE_TRANSPORT=none
export NVSHMEM_BOOTSTRAP=MPI
export NVSHMEM_HOME=/home/slothius/pkg/nvshmem_3.2.5-1
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/slothius/pkg/nvshmem_3.2.5-1/lib
export CMAKE_PREFIX_PATH=$CMAKE_PREFIX_PATH:/home/slothius/pkg/nvshmem_3.2.5-1
I ran the below minimal reproducer with compute-sanitizer
. I observed very many warnings about calls to cuPointerGetAttribute
with host pointers inside of MPI. I assume this is from NVSHMEM interacting with MPI. From some quick Googling this appears to be a known issue (“[these] are benign errors” and “this is very annoying if we want to run cuda-memcheck and somewhere in the code there is a call to this function”).
nvshmem-cublas.tar.gz (2.1 KB)