CUDA IPC - Virtual Memory API (cuMemImportFromShareableHandle CUDA_ERROR_INVALID_DEVICE ) - CUDA 11.3

I am adapting the cuda sample MemMapIPC into 2 very simple producer / consumer applications. The producer works just find (so it seems), however my consumer application returns a “invalid device ordinal” error when I call cuMemImportFromShareableHandle. I am not sure why, as this call does not take a deviceID parameter. My code is almost identical to the example code, and i verified the file descriptor is passed through the IPC correctly, so my shareableHandles vector looks accurate. Any insight or suggestions to further debug as to why cuMemImportFromShareableHandle would return CUDA_ERROR_INVALID_DEVICE = 101 ?

Producer: Creates, Shares, and also accesses the the device mem to put some data in it

  • cuMemCreate to allocate
  • cuMemExportToShareableHandle to create the fd
  • cuMemAddressReserve to reserve virtual address space
  • cuMemMap to map for access
  • cuMemSetAccess to set permissions
  • copies some host data into the memory
  • goes to sleep for 30 seconds

Consumer: Gets the fd via IPC, maps the memory, and accesses the data

  • cuMemAddressReserve to reserve virtual address space
  • cuMemMap to map for us
  • cuMemImportFromShareableHandle to turn the int fd back into CUDA IPC Handles
  • cuMemMap to map for access
  • runs kernel to print some data


Ubuntu 20.04
Thu Apr 22 20:09:06 2021
| NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3 |
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
| 0 NVIDIA A100-PCI… On | 00000000:39:00.0 Off | 0 |
| N/A 32C P0 44W / 250W | 4MiB / 40536MiB | 0% Default |
| | | Disabled |
| 1 NVIDIA A100-PCI… On | 00000000:3C:00.0 Off | 0 |
| N/A 32C P0 43W / 250W | 4MiB / 40536MiB | 0% Default |
| | | Disabled |

Attached is a complete example demonstrating the issue i am observing.

  • simple-mmap-prod-con.tar.gz
  • md5sum: 3a29415ec0696b88d29b48b776d25f2b

simple-mmap-prod-con.tar.gz (12.0 KB)

After spending some time with the MemMapIPC example, I realize it may not fit my use case.

The file descriptors generated by cuMemExportToShareableHandle into ShareableHandle are local to your process space, and can be viewed under /proc/pid/fd linked to /dev/nvidiactl

They are not, by themselves, shareable across process space bounds. Process B would need knowledge of Process A PID and access to /proc/pid-a/ filesystem

The example uses unix CMSG sendmsg / recvmsg with SCM_RIGHTS to recreate the file descriptors in process B with the system command dup.

Im not sure what magic happens in the device driver to map these back into CUDA Device Memory Handles.

Since my use case involves using host shared memory as as a means to share IPC handles between a producers and many dockerized consumer processes that are unaware of the producer process, it looks like this example is not a good starting point for me.

I have an existing implementation based on cudaIpcGetMemHandle , and it seems i am sticking with that for now… unless there is a way to use those handles with the new API functions (i.e. cuAddressReserve, cuMemMap )…

Wrt to that, are CUmemGenericAllocationHandle the same as cudaIpcGetMemHandle ?