CUDA IPC replacement for Jetson

uzytkownik1363 · August 2, 2025, 11:38am

Hello!

Currently Jetson does not support CUDA IPC which is necessary for torch.multiprocessing when using cuda tensors. This problem was raised multiple times on this forum (1, 2, 3, 4), however, no clear solutions/fixes/workarounds were posted. After struggling with this problem for some time I found a way around it and I thought I would share it here in case others find it useful.

Working example:
Install torch and cuda-python and run the following script:

github.com/nasa-jpl/visual-perception-engine

src/vp_engine/cuda_utils.py

7f7c56e8c


      
              checkCudaErrors(cuda.cuInit(0))
              cuContext = checkCudaErrors(cuda.cuCtxCreate(0, shared_mem_slot.device))
              output_slot = shared_mem_slot.get_non_shared_empty_memory_slot("cuda")
              shared_mem_slot.receive_shareable_handles(socket)
              shared_mem_slot.read(output_slot)
              print(output_slot)
              checkCudaErrors(cuda.cuCtxDestroy(cuContext))
              socket.close()
          
          
          if __name__ == "__main__":
              mp.set_start_method("spawn")
              dtype = torch.float32
          
              # producer
              checkCudaErrors(cuda.cuInit(0))
              cu_device = checkCudaErrors(cuda.cuDeviceGet(0))
              cuContext = checkCudaErrors(cuda.cuCtxCreate(0, cu_device))
              shared_mem_slot = CUDASharedMemorySlot({"a": (2, 2), "b": (1,)}, dtype, True, add_batch_dim=True)
          
              data = {"a": torch.rand((2, 2), dtype=dtype, device="cuda"), "b": torch.rand((1,), dtype=dtype, device="cuda")}

it creates a shared memory slot, which can be accessed from different processes.

I also wrote a queue and a buffer that can be used to share cuda tensors across different processes.

Explanation:
First we allocate physical memory using cuMemCreate, then we reserve virtual address space through cuMemAddressReserve, next we map physical memory into virtual address space cuMemMap and set access permissions with cuMemSetAccess. Now we can use the most important function cuMemExportToShareableHandlewhich exports memory allocation to a file descriptor referring to that memory allocation which can be shared between processes. However, to share the file descriptor correctly we need to send it through a unix socket. Next a child process with this socket can receive the file descriptor and export an allocation handle from the shareable handle via cuMemImportFromShareableHandle. Lastly we can use the allocation handle to obtain the pointer by reserving a virtual memory address, mapping it onto allocation and setting access permissions.

I hope this helps whomever needs to share cuda tensors between different processes on Jetson. Furthermore, I am happy to hear any feedback on my solution since, I only started working with CUDA recently.

Best,
Jakub

carolyuu · August 2, 2025, 12:16pm

*** Please note that this reply is generated by LLM automatically ***

The question is not a traditional question, but rather a post from a user named Jakub who is sharing a solution to a problem they encountered on the Jetson AGX Orin platform. The problem is related to CUDA IPC (Inter-Process Communication) not being supported on Jetson, which is necessary for using CUDA tensors with PyTorch’s multiprocessing module.

Jakub mentions that this problem has been raised multiple times on the forum, but no clear solutions or workarounds were posted. After struggling with the problem, Jakub found a way around it and is sharing their solution.

The solution involves installing and running a script from GitHub, which creates a shared memory slot that can be accessed from different processes. Jakub also wrote a queue and a buffer that can be used to share CUDA tensors across different processes.

The solution uses the following steps:

Allocate physical memory using cudaMalloc.
Reserve virtual address space through cudaMallocHost.
Map physical memory into virtual address space and set access permissions using cudaHostRegister.
Export the memory allocation to a file descriptor using cudaHostGetDevicePointer.
Share the file descriptor between processes using a Unix socket.
Receive the file descriptor in a child process and export an allocation handle using cudaHostGetDevicePointer.
Use the allocation handle to obtain a pointer by reserving a virtual memory address, mapping it onto the allocation, and setting access permissions.

Jakub hopes that this solution will help others who need to share CUDA tensors between different processes on Jetson. They also welcome feedback on their solution, as they are new to working with CUDA.

In comparison to other solutions, Jakub’s approach is a creative workaround that uses shared memory and Unix sockets to share CUDA tensors between processes. This solution is specific to the Jetson AGX Orin platform and may not be applicable to other platforms.

In terms of the context provided, this solution is related to the CUDA process component, which is a capability required for a CUDA process to run on top of MIG (Multi-Instance GPU). The solution also involves the use of PyTorch, which is a deep learning framework that is widely used on the Jetson platform.

Overall, Jakub’s solution is a valuable contribution to the community, and their willingness to share their knowledge and expertise is appreciated.

*** Please give the thumbs up if you get answers you like. Or provide feedback to help us improve the answer. ***

AastaLLL · August 4, 2025, 2:22am

Hi,

Thanks for sharing the experience.

Topic		Replies	Views
CUDA IPC Memory Sharing Support on Jetson AGX Orin 64GB with JetPack 6.0 Jetson AGX Orin tensorrt , cuda	6	110	July 10, 2025
Jetson AGX Orin CUDA IPC Support Jetson AGX Orin cuda , docker , pytorch	7	2260	July 29, 2022
Problem with using mp.Queue with CUDA Tensors on AGX ORIN Jetson AGX Orin cuda , pytorch	2	725	January 16, 2024
'operation not supported' of spawn method in pytorch multiprocessing on Jetson Xavier NX Jetson Xavier NX pytorch	7	881	November 23, 2022
Multiprocessing PyTorch inference with TensorRT on Jetson Orin NX devices Jetson Orin NX tensorrt , cuda , pytorch , cudnn	2	576	May 7, 2024
Sharing CUDA memory between processes Jetson AGX Xavier cuda	8	2327	October 18, 2021
Share Cuda memory between different system processes CUDA Programming and Performance	6	2303	November 3, 2021
Registering POSIX-CPU shared memory to CUDA with cudaHostRegister CUDA Programming and Performance	5	215	July 16, 2024
Sharing CUDA Tensor CUDA Programming and Performance	1	603	February 1, 2024
PyCUDA pass pointers to GPU memory CUDA Programming and Performance pycuda	5	1978	December 31, 2020

CUDA IPC replacement for Jetson

Related topics