Out of memory when allocating local memory


I am trying to a large local memory array inside a kernel, but it reports out of memory at run time. Although the size (210816 bytes) is quite large, I think it is within the limit of local memory per thread. I am only running with 1 threads, so the total size should be acceptable as well. Besides, there is no other processes running on the same device.

To reproduce, you can use the following code:

#include <cstdio>
#include <cstdlib>

#define checkCudaError(call)                                              \
    {                                                                     \
        auto err = (call);                                                \
        if (cudaSuccess != err) {                                         \
            fprintf(stderr, "CUDA error in file '%s' in line %i : %s.\n", \
                    __FILE__, __LINE__, cudaGetErrorString(err));         \
            exit(-1);                                                     \
        }                                                                 \

__global__ void kernel() { float arr[52704]; }

int main() {
    kernel<<<1, 1>>>();
    return 0;

Compile it with nvcc test.cu -o test --resource-usage -O0 -G -g -arch=sm_70 (note: -O0 -G is to prevent the compiler from optimizing out the array). Compiling with CUDA 11.6 and testing with V100-SXM2-32GB resulting in an Out-of-Memory error.

The total size of the array is 210816 bytes (which is consistent with what nvcc --resource-usage reports), and it is lower than the limit of max local memory per thread, which is 512KB (CUDA C++ Programming Guide).

Are there any other limitations on local memory size, or is it a bug? Looking forward to any useful information.

Yes, there is another limit on local memory (and related: stack, since stack manifests in the logical local space per thread). njuffa has described it here

I think if you run through that math for your V100 GPU, you will find the problem. The calculation will show that your 210816 byte per thread request requires 34540093440 bytes when considered device-wide (for V100 device), and that exceeds the 32GB available on your GPU. (Anticipating: No, the launch configuration <<<1,1>>> is not considered in this analysis.)

You could simply allocate the array outside of the kernel. Unless the array is accessed with compile-time constant indices, the local array will be put in global memory anyways.

You could also use in-kernel dynamic memory allocation. For the <<<1,1>>> case, you would not run out of memory.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.