I have a kernel which runs specific set of calculations for many points. These calculations on each thread (point) needs large amount of temporary variables (~ a double array of 5000). Therefore, I can not use shared memory for these variables. If I use global memory and allocate necessary space for them, considering the number of points I have, it exceeds the capacity of my GPU. Therefore, I have used local variables inside kernel and device functions to avoid out of memory. I also stored the results in the global memory, allocated before kernel execution.
I monitor how much RAM the GPU uses both before and after the kernel is run. Before execution, the memory utilization is in line with my global memory allocation. But, after running the kernel it jumps suddenly (around 7 GB) and even after finishing the execution of it, my code does not release the memory. I have checked for memory leaks and I could not found any. Also, I have tested my code using memcheck and there was no problem.
It is very interesting that the size of this excess memory does not depend on the amount of points and it is always constant.
Anyone has information on this matter? is there any mechanism in cuda which needs and uses this memory?
Thanks in advance.