Hi there,
I am trying to get NVSHMEM to work with UCX, but as soon as I enable it by setting NVSHMEM_REMOTE_TRANSPORT=ucx it gives me a segfault when trying to allocate symmetric memory. Please see the minimal example code below:
#include <nvshmem.h>
int main(int argc, char *argv[]) {
nvshmem_init();
int nodeRank = nvshmem_team_my_pe(NVSHMEMX_TEAM_NODE);
cudaSetDevice(nodeRank);
auto *slice = (double *) nvshmem_malloc(1024 * sizeof(double));
nvshmem_free(slice);
nvshmem_finalize();
return 0;
}
Executing this code even with a single process and a single A100 GPU returns the following error:
Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
I have built NVSHMEM with UCX 1.11.0 (even though I also tested it with 1.10.1 which returns the same error) and UCX is configured with both --enable-mt
and --with-dm
as noted in the installation guide.
Am I missing a configuration flag or am I doing something wrong? I’d be grateful if someone who successfully uses NVSHMEM with UCX in this way could give me a hint or point me in the right direction.
Thanks in advance!
Best regards,
Pascal