Segfault when allocating symmetric memory in NVSHMEM with UCX support

pazkal · July 27, 2021, 4:38pm

Hi there,

I am trying to get NVSHMEM to work with UCX, but as soon as I enable it by setting NVSHMEM_REMOTE_TRANSPORT=ucx it gives me a segfault when trying to allocate symmetric memory. Please see the minimal example code below:

#include <nvshmem.h>

int main(int argc, char *argv[]) {

    nvshmem_init();

    int nodeRank = nvshmem_team_my_pe(NVSHMEMX_TEAM_NODE);
    cudaSetDevice(nodeRank);

    auto *slice = (double *) nvshmem_malloc(1024 * sizeof(double));
    nvshmem_free(slice);

    nvshmem_finalize();
    return 0;
}

Executing this code even with a single process and a single A100 GPU returns the following error:

Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))

I have built NVSHMEM with UCX 1.11.0 (even though I also tested it with 1.10.1 which returns the same error) and UCX is configured with both --enable-mt and --with-dm as noted in the installation guide.

Am I missing a configuration flag or am I doing something wrong? I’d be grateful if someone who successfully uses NVSHMEM with UCX in this way could give me a hint or point me in the right direction.

Thanks in advance!

Best regards,
Pascal

pazkal · July 29, 2021, 3:36pm

I managed to track the problem down to an issue with the GDRCopy library that is present on the cluster that I am trying to run my applications on. As far as I can see the missing gdrdrv kernel module seems to be the culprit here. I can’t really test that hypothesis since I can’t install kernel modules on the cluster on my own, but I’ll provide an answer as soon as the cluster administrators have installed that module.

Topic		Replies	Views
Potential NVSHMEM allocated memory performance issue GPU-Accelerated Libraries nvshmem	19	1724	May 10, 2024
NVSHMEM Installation undefined reference to `__sync_synchronize' GPU-Accelerated Libraries nvshmem	2	352	June 13, 2024
Seg fault on program end when using NVSHMEM and cuBLAS GPU-Accelerated Libraries cublas , nvshmem	2	150	April 19, 2025
NVSHMEM fails to compile using nvcc GPU-Accelerated Libraries hw , cuda , kernel	4	211	July 24, 2024
NVSHMEM runtime initialization GPU-Accelerated Libraries nvshmem	1	198	November 14, 2024
NVSHMEM program fails to initialize Other Tools	0	364	November 16, 2020
Raise error when link nvshmem in my application Legacy PGI Compilers cuda , cudnn	13	1776	January 2, 2024
Internode nvshmme and ib problem GPU-Accelerated Libraries nvshmem	20	1728	April 24, 2024
Failure in installation of nvshmem GPU-Accelerated Libraries cuda , nvshmem	5	543	March 13, 2024
NVSHMEM Compilling GPU-Accelerated Libraries nvshmem	5	824	January 2, 2024

Segfault when allocating symmetric memory in NVSHMEM with UCX support

Related topics