Segfault when allocating symmetric memory in NVSHMEM with UCX support

Hi there,

I am trying to get NVSHMEM to work with UCX, but as soon as I enable it by setting NVSHMEM_REMOTE_TRANSPORT=ucx it gives me a segfault when trying to allocate symmetric memory. Please see the minimal example code below:

#include <nvshmem.h>

int main(int argc, char *argv[]) {

    nvshmem_init();

    int nodeRank = nvshmem_team_my_pe(NVSHMEMX_TEAM_NODE);
    cudaSetDevice(nodeRank);

    auto *slice = (double *) nvshmem_malloc(1024 * sizeof(double));
    nvshmem_free(slice);

    nvshmem_finalize();
    return 0;
}

Executing this code even with a single process and a single A100 GPU returns the following error:

Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))

I have built NVSHMEM with UCX 1.11.0 (even though I also tested it with 1.10.1 which returns the same error) and UCX is configured with both --enable-mt and --with-dm as noted in the installation guide.

Am I missing a configuration flag or am I doing something wrong? I’d be grateful if someone who successfully uses NVSHMEM with UCX in this way could give me a hint or point me in the right direction.

Thanks in advance!

Best regards,
Pascal

I managed to track the problem down to an issue with the GDRCopy library that is present on the cluster that I am trying to run my applications on. As far as I can see the missing gdrdrv kernel module seems to be the culprit here. I can’t really test that hypothesis since I can’t install kernel modules on the cluster on my own, but I’ll provide an answer as soon as the cluster administrators have installed that module.

3 Likes