When I define a large (>1024 byte) fixed-size array within a kernel, the compiler appears to create a global memory allocation which persists past the completion of the kernel and, as far as I have been able to tell, there is no way to free this allocation for the remainder of the program. I have attached a minimal example below. This is problematic for me as my program requires a kernel with such a large array, then must run a subsequent operation which utilizes almost all device memory. This is currently not possible because of the memory leaked by the first kernel which I have no way of freeing.
I have attempted to use cudaDeviceReset() which appears to impact this allocation somehow, but causes all kinds of other problems for me as it deletes many things I do not wish it to. Ideally there would be something like cudaFreeSpilledGlobalMem() that would specifically free an allocation like this.
I have found no documentation on this topic, what exactly the compiler is doing here and why. For example it does appear that this leaked global memory allocation is reused by any subsequent kernels that similarly spill into global memory, but I can find no information confirming this is the case.
So, if anyone could direct me on how to properly free this leaked memory, or at least towards documentation discussing what is going on here, I would be greatly appreciative.
#include <iostream>
constexpr int N = 1 << 20;
constexpr int TPB = 128;
constexpr int BPG = (N + TPB - 1) / TPB;
// Reducing the spilled array size reduces memory leak; leak is eliminated at
// REG_ARR_LN = 256 (1 kB)
constexpr int REG_ARR_LN = 32 * 64;
__global__ void kernel(int loop_num) {
uint32_t spilled[REG_ARR_LN];
// Do some operation with spilled to ensure compiler does not optimize it away
for (int i = 0; i < loop_num; ++i) {
spilled[i] = spilled[(i - 1) % REG_ARR_LN];
}
}
__global__ void kernel_gsl(int loop_num) {
uint32_t tid = blockIdx.x * blockDim.x + threadIdx.x;
uint32_t spilled[REG_ARR_LN];
for (int idx = tid; idx < N; idx += gridDim.x * blockDim.x) {
for (int i = 0; i < loop_num; ++i) {
spilled[i] = spilled[(i - 1) % REG_ARR_LN];
}
}
}
int main() {
size_t free_mem = 0, total_mem = 0;
cudaMemGetInfo(&free_mem, &total_mem);
// Calling kernel once leaks some memory that cannot be recovered without
// resetting device
kernel<<<BPG, TPB>>>(REG_ARR_LN);
// Repeated calls do not increase leak size
// for (int i = 0; i < 100; ++i) {
// kernel<<<BPG, TPB>>>(REG_ARR_LN);
// }
// Theoretically, resetting device might eliminate memory leak, but causes all
// kinds of problems with proceeding to use the cuda API; here it causes an
// error trying to subsequently fetch memory info. Even if it worked properly,
// it has much broader impacts that are not always desired; for example I may
// wish to free this leaked memory without deleting some existing streams,
// which appears to be impossible
// cudaDeviceSynchronize();
// cudaDeviceReset();
// Smaller block dim has same result
// kernel<<<BPG * 2, TPB / 2>>>(REG_ARR_LN);
// Grid strid loop version has same result
// kernel_gsl<<<128, TPB>>>(REG_ARR_LN);
// Running both kernel and kernel_gsl maintains the same leak size, so it
// appears the "leaked" global memory allocation is reused by any kernels
// spilling into global memory
// kernel<<<BPG, TPB>>>(REG_ARR_LN);
// kernel_gsl<<<128, TPB>>>(REG_ARR_LN);
size_t new_free_mem = 0;
cudaError_t err = cudaMemGetInfo(&new_free_mem, &total_mem);
if (err != cudaSuccess) {
std::cerr << cudaGetErrorString(err) << std::endl;
} else {
std::cout << "Leaked GPU memory: "
<< (free_mem - new_free_mem) / (1024 * 1024) << " MB\n";
}
return 0;
}
Allocating large-ish arrays in functions has never been a great idea, IMHO, but stylistic preferences differ. Alternatively, you could dynamically allocate memory for the workspace needed by the kernel (with LAPACK style query capability if need be), and pass a pointer to the allocated memory to the kernel.
For my application the performance impact of the large array is negligible compared to other bottlenecks, and makes for some more convenient code. That being said, I am more concerned about the discovery that there is any scenario in which a kernel is creating a global memory allocation which persists past the life of the kernel, with no handle to manage the allocation by. I haven’t been able to find any documentation on this and would at very least like to have an understanding of how it works.
To clarify, you’re suggesting manually creating a global memory allocation for the “overhang” data; ie, If I need 2048 bytes for the data I’m currently storing in this array, creating an array in the kernel to store the first 1024 bytes and a manual global memory allocation equal to the amount needed to store the remaining spilled 1024 bytes/thread? That’s a good idea for a workaround, although introduces some code complexity that seems like it shouldn’t be necessary. It would also require knowing the size of array that will not spill to global memory; I measured that as 1024 but it is unclear to me how you compute that value/what hardware parameters it is determined by.
I am not sure where the notion of “spilled memory” comes from here. What I am suggesting is that all large data objects should be allocated outside of functions, and pointers or references to these data objects then passed to the functions, completely independent of the use of CUDA. For CUDA specifically, one could buffer such data in shared memory where that makes sense (i.e. there is re-use inside the kernel).
In CUDA, by default, any data objects defined locally to a kernel, like spilled here, are assigned to (thread-) local memory, which is a mapped portion of global memory. As an optimization, the compiler can pull some of these data objects into registers. This typically happens for scalars. In the case of arrays that will only happen if they are “small” (as determined by some heuristic) and indexing for all accesses is compile-time constant. The array spilled here is definitely not small.
Only when the compiler first decides to move a data object into a register and later discovers that it ran out of registers does it temporarily spill the data back to local memory. It usually does so intelligently, e.g. outside of the inner loops of loop nests. Not all uses of local memory are thus related to spilling. To avoid confusion one would want to keep the terminology straight.
Thanks, appreciate the terminology clarification as yes I was mingling the concepts of truly spilled registers and local arrays potentially being optimized into register memory. In my application, the REG_ARR_LN is actually generated by a preprocessor and can be either small enough to potentially be optimized into register memory, or large enough where that will clearly not happen. It would be convenient for my purposes to use the same concise bit of code for both cases (since again the performance impact of the entire component is negligible), but it seems the unreleased local memory is a known issue without a planned solution so I will have to do something more sophisticated.
Interesting to see that the issue was discussed five years ago. Was a bug filed at the time? If so, what is the status of that bug? My assumption is that always freeing the memory has negative performance implications, which is why the equivalent action of the suggested workaround is is not applied by default.
Given that was 5 years ago, any information on whether this is something that is still being worked on?
Edit: Also, when I try inserting the cudaDeviceSetLimit(cudaLimitStackSize, 0); after my kernel call, I get an illegal memory access was encountered on the subsequent cudaMemGetInfo call, so it doesn’t look like that workaround is a safe solution…
Nothing is still being worked on. I did file a bug (2126657) ~6 years ago related to this, however it tracked separate observations from what was reported by the original poster. The suggested response at the time was to use the provided API to manage this, and relevant to this specific report, no further action was taken, and the bug was closed.
You’re welcome to file a bug if you wish, if you have specific requests. If your request is for an “auto-resize” feature, I can state that was looked at previously and not acted on. I won’t be able to go into reasons or an extended discussion of what was investigated 5-6 years ago.
You should probably provide a complete example. It may be necessary to do the cudaDeviceSetLimit() call after a device synchronization step.
cudaDeviceSetLimit: an illegal memory access was encountered
cudaMemGetInfo: an illegal memory access was encountered
so I am not seeing how to use this api to get the desired result. If there is a way to use it to free this memory for further use then that is an adequate solution for me, otherwise yes I will file a bug.
(editted to add verification that the error isn’t happening during kernel execution)
You may wish to learn more about code verification, debugging, and error reporting in CUDA. Unit 12 of this online course covers relevant topics and may be of interest.
You have illegal code. This has nothing to do with your usage of the cudaDeviceSetLimit() api.
If you run your code under compute-sanitizer, you will see that errors are reported even if the cudaDeviceSetLimit() line is commented out.
Yep noticed that, that’s what I get for trying to respond quickly; missed the cudaDeviceSynchronize after kernel call to successfully register the error that happened there.
So nevermind, the cudaDeviceSetLimit api indeed solves the problem for me. Thank you!
FWIW I don’t think the cudaDeviceSynchronize() after the kernel call is strictly necessary to make the cudaDeviceSetLimit() call have proper behavior. I should have probably not mentioned that.
Yes I meant I missed the cudaDeviceSynchronize() necessary between kernel<<>>() and cudaGetLastError, which meant I wasn’t successfully catching the error in the kernel and blamed it on cudaDeviceSetLimit().
Glad that you found a solution with cudaDeviceSetLimit.
But just for completeness, as your kernel is not performance critical, you could have always used an externally allocated global memory array, instead of using the feature that sometimes registers are used. The L1 cache would have also helped.
For completeness you could have marked the memory as not needed any more at the end of your kernel so that the cache is not written back.
What will actually make the most sense for my application is going to depend on a number of other factors; really, what I have now is a kernel with suboptimal but acceptable performance using a large local array, but if I do some optimization down the road it should be possible to substantially reduce the size of that array, to the point where it could probably fit entirely in registers. I was hoping to avoid implementing something different like the global memory solution you suggest to only serve as a stand-in until I do that optimization. But really, I just wanted to know what was going on with this persistent local memory allocation and am happy to find there is indeed a way to free it.