Hi,
I am trying to a large local memory array inside a kernel, but it reports out of memory at run time. Although the size (210816 bytes) is quite large, I think it is within the limit of local memory per thread. I am only running with 1 threads, so the total size should be acceptable as well. Besides, there is no other processes running on the same device.
To reproduce, you can use the following code:
#include <cstdio>
#include <cstdlib>
#define checkCudaError(call) \
{ \
auto err = (call); \
if (cudaSuccess != err) { \
fprintf(stderr, "CUDA error in file '%s' in line %i : %s.\n", \
__FILE__, __LINE__, cudaGetErrorString(err)); \
exit(-1); \
} \
}
__global__ void kernel() { float arr[52704]; }
int main() {
kernel<<<1, 1>>>();
checkCudaError(cudaGetLastError());
return 0;
}
Compile it with nvcc test.cu -o test --resource-usage -O0 -G -g -arch=sm_70
(note: -O0 -G
is to prevent the compiler from optimizing out the array). Compiling with CUDA 11.6 and testing with V100-SXM2-32GB resulting in an Out-of-Memory error.
The total size of the array is 210816 bytes (which is consistent with what nvcc --resource-usage
reports), and it is lower than the limit of max local memory per thread, which is 512KB (CUDA C++ Programming Guide).
Are there any other limitations on local memory size, or is it a bug? Looking forward to any useful information.