Unaccounted memory consumption while running the kernel

I have a kernel which runs specific set of calculations for many points. These calculations on each thread (point) needs large amount of temporary variables (~ a double array of 5000). Therefore, I can not use shared memory for these variables. If I use global memory and allocate necessary space for them, considering the number of points I have, it exceeds the capacity of my GPU. Therefore, I have used local variables inside kernel and device functions to avoid out of memory. I also stored the results in the global memory, allocated before kernel execution.

I monitor how much RAM the GPU uses both before and after the kernel is run. Before execution, the memory utilization is in line with my global memory allocation. But, after running the kernel it jumps suddenly (around 7 GB) and even after finishing the execution of it, my code does not release the memory. I have checked for memory leaks and I could not found any. Also, I have tested my code using memcheck and there was no problem.

It is very interesting that the size of this excess memory does not depend on the amount of points and it is always constant.

Anyone has information on this matter? is there any mechanism in cuda which needs and uses this memory?

Thanks in advance.

local memory is stored in the same physical backing (GPU DRAM memory) as the logical global space. Therefore a large local allocation per thread will use up a large amount of this space. The amount will be determined by the size of the per-thread allocation and the characteristics of your GPU, which is why it appears to be “constant”.

The memory is not immediately released when your kernel finishes. It will/should be released when your application finishes.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.