CUDA thread local array memory is not released after finished kernel execution

vincent91 · May 16, 2018, 5:03am

I have encountered a problem of which the cuda thread local array memory is not released after kernel execution.

I am using CUDA 9.1 (compute_61,sm_61) on Windows 10.

Below is the code snippet to query device memory:

void query_device_memory(const char * str)
{
  size_t free_byte;
  size_t total_byte;
  auto cuda_status = cudaMemGetInfo(&free_byte, &total_byte);

  if (cudaSuccess != cuda_status)
  {

    printf("Error: cudaMemGetInfo fails, %s \n", cudaGetErrorString(cuda_status));
   }

  double free_db = (double)free_byte;
  double total_db = (double)total_byte;
  double used_db = total_db - free_db;

  std::cout << "GPU memory usage: used = " << used_db / 1024.0 / 1024.0 << std::endl;
}

Below is the dummy kernel used to illustrate the problem:

__global__ void dummy_kernel(float * dumm)
{
  auto const index = threadIdx.x;
  int const val = *dumm;
  float buff_1[16384] = {};
  for (size_t i = 0; i < 16384; i++)
  {
    buff_1[(index * val) % 16384] += *dumm;
  }
  *dumm = buff_1[val % 16384];
}

int main()
{
  query_device_memory("before:");

  float * d_dumm;
  CHECK(cudaMalloc(&d_dumm, sizeof(float)));
  CHECK(cudaMemset(d_dumm, 0, sizeof(float)));
  query_device_memory("after malloc:");
  dummy_kernel << <1000, 32 >> > (d_dumm);
  CHECK(cudaGetLastError());
  CHECK(cudaDeviceSynchronize());
  query_device_memory("after:");
  float answer = 0.f;
  CHECK(cudaMemcpy(&answer, d_dumm, sizeof(float), cudaMemcpyDeviceToHost));
  CHECK(cudaDeviceReset());
  return 0;
}

The output of the program:

GPU memory usage: used = 2204.38
GPU memory usage: used = 2206.38
GPU memory usage: used = 4878.38

Robert_Crovella · May 11, 2019, 2:09am

This issue is still being worked on. I don’t have any information about a timeframe when the behavior may change.

In the meantime, a method to mitigate the effect is to issue:

cudaDeviceSetLimit(cudaLimitStackSize, 0);

after the kernel call where the memory is desired to be freed.

Using a value of 0 will reset the property to its “default” value for the architecture. Alternatively, if the stack size is needed to be some other value for a subsequent kernel call, that value can be used instead.

After running the above line of code, a subsequent call to cudaMemGetInfo should return results more in line with expectations.

carmelb88 · December 1, 2019, 3:12pm

Hi,
Is this problem only in the values reported by

cudaMemGetInfo()

and the memory is actually available when using cudaMalloc, or is the memory not being freed at all untill calling the setDeviceLimit?

Thanks

Topic		Replies	Views
Global memory "leak" after register spill? CUDA Programming and Performance cuda , kernel	16	304	September 3, 2024
CUDA in-kernel malloc CUDA Programming and Performance	4	10000	July 19, 2011
Is there a time delay when using cuMemGetInfo or cudaFree? CUDA Programming and Performance	3	3908	March 12, 2009
Problem: What is going on with memory on card ? Why it is wasted so significantly ? CUDA Programming and Performance	4	1717	September 1, 2008
how to get more global memory available CUDA Programming and Performance	5	1142	June 18, 2013
memory exhausted on GPU CUDA Programming and Performance	3	1130	September 9, 2014
cudaDeviceSetLimit bug CUDA Programming and Performance	6	176	January 21, 2025
cudamalloc not allocating memeory CUDA Programming and Performance	0	1314	May 1, 2012
cudaFree leaves zero memory CUDA Programming and Performance	0	660	May 29, 2013
Could not clear or free all the cpu memory when using cudaMalloc founction? CUDA Programming and Performance	1	555	December 19, 2018

CUDA thread local array memory is not released after finished kernel execution

Related topics