How can I debug 'mapping of buffer object failed', which only happens on some computers?

I am compiling + running a custom allreduce implementation using MPI.
The code is here: gpu_kernels/allreduce/csrc/reference_allreduce at main · vedantroy/gpu_kernels · GitHub, but what’s important is:

I compile the code with nvcc:

nvcc -I/usr/include -I/usr/lib/x86_64-linux-gnu/openmpi/include/openmpi -I/usr/lib/x86_64-linux-gnu/openmpi/include -L/usr/lib/x86_64-linux-gnu/openmpi/lib -lmpi -L/usr/lib/x86_64-linux-gnu -lnccl -gencode=arch=$arch,code=$code "$script_dir/" -o fastallreduce_test.bin```

And I run it with the command:

mpirun  --allow-run-as-root -np 2 csrc/reference_allreduce/fastallreduce_test.bin

This command works on a conventional machine, however, when I run it on a serverless provider, like Modal (, first:

  • MPI fails with “All nodes which are allocated for this job are already filled.”
  • So then, I add a hostfile which fixes the issue

But, then I get the error:

WARNING: The default btl_vader_single_copy_mechanism CMA is
not available due to different user namespaces.

and the error ‘mapping of buffer object failed’ from cudaIpcOpenMemHandle.
Any thoughts on what I could do to work around this? Maybe it’s not possible.

I don’t have a wholistic answer for you. If I were working on this, the first thing I would suspect is that your host-based IPC is failing, either because you are not properly checking for errors, or else silently.

cudaIpcOpenMemHandle expects to be given a memory handle that was created in another process. First make sure you are not trying to open a mem handle that was created in the same process that you are trying to open it from (that is not allowed). Then print out the numerical value of the mem handle in the process that created the mem handle and in the process that is trying to run cudaIpcOpenMemHandle on it.

If the numerical values are the same, then there is some deeper problem that I cannot fathom.

If the numerical values are not the same, then your method for communicating the handle from one process to the other is failing, and I would focus my debug effort on that.

I have not inspected your code nor do I know anything about, so this advice may be misguided or off-base.

You could also ask modal about it, if they have some sort of support mechanism.

Another thought that occurs to me is that for CUDA IPC to work, I’m fairly certain that the GPUs hosting the buffers must be visible to all interested parties.

If you or your service is launching MPI ranks with a preamble that include e.g. CUDA_VISIBLE_DEVICES="..." you may want to check if that is a factor.

Also, if you or your service is running this in a container, there are container ipc settings that may be important. A google search will turn up some examples.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.