How to Free GPU Memory?

I am trying object detection (jetson-inferenct DetectNet) for OpenCV camera input.

To allocate cv::Mat to GPU memory, I created a function almost like loadImageRGBA in (loadImage.cpp).

bool mapMatToGPU(cv::Mat img, float4** cpu, float4** gpu) {
  QImage qImg = QImage((uchar*), img.cols, img.rows, img.step,
  if (!cudaAllocMapped((void**) cpu, (void**) gpu,
		       qImg.width() * qImg.height() * sizeof(float) * 4)) {
    return false;
  float4* cpuPtr = *cpu;
  for (uint32_t y=0; y < qImg.height(); y++) {
    for (uint32_t x=0; x < qImg.width(); x++) {
      const QRgb rgb  = qImg.pixel(x,y);
      const float4 px = make_float4(float(qRed(rgb)),
      cpuPtr[y * qImg.width() + x] = px;
  return true;

The only crucial change is that I use QImage qImg = QImage(...) instead of qImg.load(...).
(Besides, I remove some printfs and assignments.)

So I could detect objects in camera input from OpenCV, however, used memory keeps increasing and it gets killed in the end…

I found that DetectNet can be killed by memory issues.

I think I need to free the allocated memory somehow.

How can I do this?

Ah, the solution was there!


You call cudaAllocMapped(), but you never release the memory.

Either re-use the previous buffer, or release the memory once you’re done with it.
cudaAllocMapped() is a wrapper for cudaHostAlloc() and cudaHostGetDevicePointer().

This is a slow operation, and you’re not supposed to do it each frame. Instead, remember the buffer you receive from the first call to allocMapped(), and re-use it for each frame.

You may in the end want to allocate two or three, and double-buffer or triple-buffer, depending on what the structure of your code and workload is.

If you absolutely need to free data, then call cudaHostFree(). However, you can’t do that until you’re done using the data allocated and returned by this function.
I highly recommend against using alloc/free each frame, though.

You mean call CUDA(cudaFreeHost(imgCPU)); after some amount of time?

I suppose @snarky meant to allocate the CUDA buffer in initialization of your program, before the image processing loop, and free it once all is done. This way, you always have the buffer available in your function called in the loop, no need to allocate/free for each image, it will speed up your code.

Two notes however:

  • if you have different sizes of images, use the biggest size for allocating the buffer. A small image will make no problem in a bigger buffer.
  • if you are multi-threading, you may have to set a lock on the buffer or have each thread allocate its own buffer.