GPU memory leaks using shareable handles

Orin AGX 64 GB Developer Kit. Jetpack 6.0 w/CUDA 12.2.

I have two processes.

Process A creates blocks of GPU memory (cuMemCreate, cuMemExportToShareableHandle, etc) using the pattern shown in memMapIPCDrv in cuda-samples. It eventually calls cuMemRelease.

Process B imports the shareable handle (cuMemImportFromShareableHandle, cuMemMap, cuMemRelease, cuSetAccess, etc), runs kernels, and eventually closes the handle and calls cuMemUnmap and cuMemAddressFree.

Both processes are persistent. On Jetson, process B grows until we run out of memory.

I’ve demonstrated this behavior by breaking the memMapIPCDrv sample into writer and reader processes and observing memory utilization with jtop. The writer process is run repeatedly but the reader is persistent. Each time the writer runs, it creates a block of memory that the reader imports and runs a simple kernel against. When the writer exits, all the memory reserved (4MB in the example) is assigned to the reader (as seen in jtop) and not released after the reader calls cuMemUnmap and cuMemAddressFree.

Running the same code on x86, GPU utilization goes to zero as soon as the reader releases it.

I reported this as a bug and was directed to post here. It was close as Not A Bug. Grateful for any tips.

7 Likes

Hi,

As we have a newer software release, could you test if the same issue occurs on the latest JetPack 6.2?

If so, could you share a reproducible sample with us?
(Process B which has separated to writer and reader should be enough?)

We will need to test this internally before sharing more info with you.
Thanks.

I’ll put together a package with my rework of the memMapIpc sample code. I’ll also try 6.2. Thank you.

I’ve attached the code I extracted from the memMapIPCDrv sample. It includes a README_DEMO. These are the results I’m seeing:
The reader is started first.


Then the writer is started, passing in the process id of the reader (for the local socket). The next picture shows jtop when both the writer and reader are running. The reader has opened the shared memory, performed a trivial operation, and called cuMemUnmap() and cuMemAddressFree().

The final picture shows jtop after the writer has exited and the reader is waiting for a new handle. Each run of the writer will grow the reader by 4MB, the size of block created by the writer by cuMemCreate().

memMapIPCDrvLeak.tar.gz (29.9 KB)

FYI: the code needs to be built in a subdirectory under cuda-samples/Samples// to pick up the Common headers.

EDIT: Attached zip should build w/o cuda-samples installed.

memMapIPCDrvLeak_v2.tar.gz (78.2 KB)

Hi,

Thanks a lot for sharing the sample.
We will test this internally and share more info with you.

Thanks.

Hi,

Thanks for your patience.

We also observed the same behavior in our environment and now is checking with our internal team for more input.
Will let you know once we have more information can share.

Thanks.