Our team recently got its hands on AGX Orin, which is very powerful. To take full advantage of what Orin can offer, we decided to virtualize our workload into multiple docker containers, which communicate with each other via Unix Socket. And to keep performance, we found that we can share a reference to the CUDA allocated memory instead of transferring the entire array, which takes way too long.
Most of our code base and logic are written in Pytorch, which has built-in support for IPC CUDA. However, Linux Tegra devices currently does not support CUDA IPC. We also tried to write our own C++/CUDA extension to use:
“Since CUDA 11.5, only events-sharing IPC APIs are supported on L4T and embedded Linux Tegra devices with compute capability 7.x and higher. The memory-sharing IPC APIs are still not supported on Tegra platforms”
Does this mean that in the future Linux Tegra devices will have full CUDA IPC support? If so, is there any time estimates?
Or due to differences in architecture Linux Tegra devices not valid for CUDA IPC?
Currently, we recommend to use the EGLStream, NvSci, or the cuMemExportToShareableHandle() / cuMemImportFromShareableHandle() APIs instead.
Did you meet any issues when using these alternatives?
As I mentioned above, most of our code base and logic is written in Python Pytorch. And we would like to transfer torch Tensors between containers with minimal delay (a couple of milliseconds at most).
We can get a pointer to the CUDA allocated memory from torch library. However, I could not find how the torch allocates its memory, I assume it’s just regular cudaMalloc after some tests, but I am probably wrong.
After going through the documentation, the use of NvSci and cuMemExportToShareableHandle() / cuMemImportFromShareableHandle() requires custom memory allocation methods, not cudaMalloc, which would necessitate creating new memory and copying from device to device (GPU 0 cudaMalloc to GPU 0 different method), that would increase time and memory usage. Unless there is an easier way?
Additionally, we cannot use EGLStreambecause:
“EGL understands only two-dimensional image data. It cannot handle tensors or other non-image sensor data.”
If you have any ideas, I would appreciate it!!!
Thanks
The memory buffer can be accessed by the CPU. But transferring data of our size from GPU to CPU takes about ~10ms, which is already slow.
We did a test where we move tensor to CPU and then to shared memory in one process, then in another, we retrieve tensor from shared memory and transfer it to GPU, it takes about ~40ms. Additionally, our communication has to be bidirectional, which increases communication to about ~80ms.
If you have some different ideas I am eager to hear them!!
We have container one that creates tensors after processing raw data with AI and sends it to container two. The second container processes tensor even further and return newly generated tensors to container one (there are 3-4 containers, but I hope you get the idea). Tensors can take different shapes, and sometimes they can be in batches, and we can only spare ~2-3 ms for transferring. IPC was an ideal solution in this case, and it worked great on PC.
After some discussions with the team, we will scrap the virtualization idea for now and go for a multipackage approach.
This is a very fascinating discussion. Can you share in general terms what kind of workload or application requires the use of two containers? Unless you are training, why not just use torchscript or onnxrt or TensorRT to bypass torch’s limitations?
Our application technically does not require the use of multiple containers. Currently, it works great in a single environment. However, we are facing issues when we want to scale our solution and make it robust and generic. That is where the idea of splitting logic into separate isolated containers comes to be. Similar to how web microservices work. It will also allow us to split code into multiple repos, which will help us with code coupling, and isolate teams that work on different parts of the code without breaking the main logic.