CUDAMALLOCHOST causing memory leak

prince.patel.14 · March 17, 2023, 4:07am

CUDA_CHECK(cudaMallocHost((void**)&argmax_buffer_cpu, BatchSize * OutputChannel * sizeof(float)));

The above code causes the leak in memory to appear in our docker container. We call it only once throughout the program.

When the code is enabled then only the leak starts to appear

The screenshot represents enabling the code vs removing the code.

Screenshot 2023-03-17 at 9.35.51 AM

NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4

Robert_Crovella · March 17, 2023, 1:31pm

If you do a cudaMallocHost without a corresponding cudaFreeHost in a repetetive code (a loop, for example) you will certainly create a memory leak. For example, if you converted a new operation (which might automatically be freed based on c++ scoping rules) with a cudaMallocHost operation as I described here (without a free operation, in a loop) you could certainly create a memory leak.

prince.patel.14 · March 17, 2023, 1:47pm

Hi @Robert_Crovella Thanks for replying.

Yes, we are doing it only once in the module run.

Let me walk you through our process.

We allocated a area in the memory to have the model outputs from the enqueue. This is static
static int** argmax_buffer_cpu = nullptr;
Then we ran the batched inference using enqueue.
We keep on rewriting the area with new images and their model outputs.

The whole process in a CPP codebase which is then made as a shared object file.
The intake of images is in the Python codebase where we decode the msg and pass those bytes to CPP functions using ctypes shown below:

from ctypes import cdll
lib = cdll.LoadLibrary("/app/cpp_trt_processing.so")
# init func
lib.InitializeGPUMemory.argtypes = [
    ctypes.c_int, # batch size
    ctypes.c_int, # InputW
    ctypes.c_int, # InputH 
    ctypes.c_int, # InputChannel
    ctypes.c_int, # OutputChannel
    ctypes.c_int, # Number of color models
    ctypes.POINTER(ctypes.c_int), # Feature sizes list for models
    ctypes.POINTER(ctypes.c_float), # Mean values list for kernels
    ctypes.POINTER(ctypes.c_float), # Scale values list for kernels
    ctypes.c_int, # DebugLevel
    ctypes.c_bool # save_flag
]

lib.InitializeGPUMemory.restype = None

The above InitializeGPUMemory runs only once in the program logic.

Now few tests, which we did were:

Running an empty function in CPP codebase : No leak seen.
Running an empty argument function in CPP: No leak seen.
Just running cudaMallocHost on our static float*: The leak is observed in the graph.

EDIT:

We also tried cudaMallocHost paired with cudaFreeHost, although it would defeat our logic of processing images. But unfortunately the leak is still seen.

Robert_Crovella · March 17, 2023, 1:52pm

my guess would be your module is being called once per inference, or once per image, and so you are getting repeated calls to cudaMallocHost. Why not put a printf statement in place right after the call to cudaMallocHost to see if it is being called more than once. If it is, then that is your coding defect. If it isn’t, I’m at a loss to explain how the simple presence of a single cudaMallocHost call could lead to a ongoing memory leak. In that case it would probably be best to create the shortest possible example that shows the leak. Once you have done that, advance to latest CUDA version to see if it still exists. If it still exists, post your example here or file a bug.

system · March 31, 2023, 1:53pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.