Jetson AGX Orin CUDA IPC Support

ross_tsenov · July 26, 2022, 5:36pm

Hello,

Our team recently got its hands on AGX Orin, which is very powerful. To take full advantage of what Orin can offer, we decided to virtualize our workload into multiple docker containers, which communicate with each other via Unix Socket. And to keep performance, we found that we can share a reference to the CUDA allocated memory instead of transferring the entire array, which takes way too long.

Most of our code base and logic are written in Pytorch, which has built-in support for IPC CUDA. However, Linux Tegra devices currently does not support CUDA IPC. We also tried to write our own C++/CUDA extension to use:

EGLStream
NvSci
cuMemExportToShareableHandle() / cuMemImportFromShareableHandle()

Unfortunately, none of the above works nicely with Torch Tensors.

The recent CUDA for Tegra app note states:

“Since CUDA 11.5, only events-sharing IPC APIs are supported on L4T and embedded Linux Tegra devices with compute capability 7.x and higher. The memory-sharing IPC APIs are still not supported on Tegra platforms”

Does this mean that in the future Linux Tegra devices will have full CUDA IPC support? If so, is there any time estimates?

Or due to differences in architecture Linux Tegra devices not valid for CUDA IPC?

AastaLLL · July 27, 2022, 3:08am

Hi,

Currently, we recommend to use the EGLStream, NvSci, or the cuMemExportToShareableHandle() / cuMemImportFromShareableHandle() APIs instead.
Did you meet any issues when using these alternatives?

Thanks.

ross_tsenov · July 27, 2022, 1:47pm

Hi,

As I mentioned above, most of our code base and logic is written in Python Pytorch. And we would like to transfer torch Tensors between containers with minimal delay (a couple of milliseconds at most).

We can get a pointer to the CUDA allocated memory from torch library. However, I could not find how the torch allocates its memory, I assume it’s just regular cudaMalloc after some tests, but I am probably wrong.

After going through the documentation, the use of NvSci and cuMemExportToShareableHandle() / cuMemImportFromShareableHandle() requires custom memory allocation methods, not cudaMalloc, which would necessitate creating new memory and copying from device to device (GPU 0 cudaMalloc to GPU 0 different method), that would increase time and memory usage. Unless there is an easier way?

Additionally, we cannot use EGLStream because:
“EGL understands only two-dimensional image data. It cannot handle tensors or other non-image sensor data.”

If you have any ideas, I would appreciate it!!!
Thanks

AastaLLL · July 28, 2022, 8:05am

Thanks for the confirmation.

CUDA IPC is not available for Jetson.
Not sure if NvSCI will be a good alternative for you.

In your use case, the memory buffer is only accessed by GPU, is this correct?
The memory buffer won’t be accessed by the CPU?

Thanks.

ross_tsenov · July 28, 2022, 2:29pm

The memory buffer can be accessed by the CPU. But transferring data of our size from GPU to CPU takes about ~10ms, which is already slow.

We did a test where we move tensor to CPU and then to shared memory in one process, then in another, we retrieve tensor from shared memory and transfer it to GPU, it takes about ~40ms. Additionally, our communication has to be bidirectional, which increases communication to about ~80ms.

If you have some different ideas I am eager to hear them!!

We have container one that creates tensors after processing raw data with AI and sends it to container two. The second container processes tensor even further and return newly generated tensors to container one (there are 3-4 containers, but I hope you get the idea). Tensors can take different shapes, and sometimes they can be in batches, and we can only spare ~2-3 ms for transferring. IPC was an ideal solution in this case, and it worked great on PC.

After some discussions with the team, we will scrap the virtualization idea for now and go for a multipackage approach.

Anyway thanks a lot!!

jaiyamsharma · July 29, 2022, 1:11am

This is a very fascinating discussion. Can you share in general terms what kind of workload or application requires the use of two containers? Unless you are training, why not just use torchscript or onnxrt or TensorRT to bypass torch’s limitations?

ross_tsenov · July 29, 2022, 3:08pm

Our application technically does not require the use of multiple containers. Currently, it works great in a single environment. However, we are facing issues when we want to scale our solution and make it robust and generic. That is where the idea of splitting logic into separate isolated containers comes to be. Similar to how web microservices work. It will also allow us to split code into multiple repos, which will help us with code coupling, and isolate teams that work on different parts of the code without breaking the main logic.

system · August 24, 2022, 1:16am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Problem with using mp.Queue with CUDA Tensors on AGX ORIN Jetson AGX Orin cuda , pytorch	2	620	January 16, 2024
New AGX Orin 64GB - no GPU? Jetson AGX Orin cuda	4	968	July 25, 2023
[QST] Use which among NVLink/PCIe to chain DriveOrin to one dGPU to scale performance? DRIVE AGX Orin General drive-platform-design	5	899	July 6, 2023
GPU Compute and memory benchmarks for Jetson AGX Orin Jetson AGX Orin performance	7	93	December 12, 2024
Sharing CUDA memory between processes Jetson AGX Xavier cuda	8	2183	October 18, 2021
Registering POSIX-CPU shared memory to CUDA with cudaHostRegister CUDA Programming and Performance	5	123	July 16, 2024
Question: AGX orin SOM CPU and GPU Memory assignment Jetson AGX Orin	2	528	July 17, 2023
JetsonAGX Orin: System-level Cache Jetson AGX Orin	5	59	December 31, 2024
Sharing cuda memory between containers x86 and jetson Jetson AGX Orin cuda , deepstream	2	33	October 21, 2024
Can Jetson Orin support nccl? Jetson Orin NX	9	1954	December 7, 2022

Jetson AGX Orin CUDA IPC Support

Related topics