Out of memory when allocating local memory

beantang.tang · January 4, 2023, 8:29am

Hi,

I am trying to a large local memory array inside a kernel, but it reports out of memory at run time. Although the size (210816 bytes) is quite large, I think it is within the limit of local memory per thread. I am only running with 1 threads, so the total size should be acceptable as well. Besides, there is no other processes running on the same device.

To reproduce, you can use the following code:

#include <cstdio>
#include <cstdlib>

#define checkCudaError(call)                                              \
    {                                                                     \
        auto err = (call);                                                \
        if (cudaSuccess != err) {                                         \
            fprintf(stderr, "CUDA error in file '%s' in line %i : %s.\n", \
                    __FILE__, __LINE__, cudaGetErrorString(err));         \
            exit(-1);                                                     \
        }                                                                 \
    }

__global__ void kernel() { float arr[52704]; }

int main() {
    kernel<<<1, 1>>>();
    checkCudaError(cudaGetLastError());
    return 0;
}

Compile it with nvcc test.cu -o test --resource-usage -O0 -G -g -arch=sm_70 (note: -O0 -G is to prevent the compiler from optimizing out the array). Compiling with CUDA 11.6 and testing with V100-SXM2-32GB resulting in an Out-of-Memory error.

The total size of the array is 210816 bytes (which is consistent with what nvcc --resource-usage reports), and it is lower than the limit of max local memory per thread, which is 512KB (CUDA C++ Programming Guide).

Are there any other limitations on local memory size, or is it a bug? Looking forward to any useful information.

Robert_Crovella · January 4, 2023, 3:46pm

Yes, there is another limit on local memory (and related: stack, since stack manifests in the logical local space per thread). njuffa has described it here

I think if you run through that math for your V100 GPU, you will find the problem. The calculation will show that your 210816 byte per thread request requires 34540093440 bytes when considered device-wide (for V100 device), and that exceeds the 32GB available on your GPU. (Anticipating: No, the launch configuration <<<1,1>>> is not considered in this analysis.)

striker159 · January 4, 2023, 8:48pm

You could simply allocate the array outside of the kernel. Unless the array is accessed with compile-time constant indices, the local array will be put in global memory anyways.

Robert_Crovella · January 4, 2023, 8:59pm

You could also use in-kernel dynamic memory allocation. For the <<<1,1>>> case, you would not run out of memory.

system · January 18, 2023, 8:59pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Local memory size CUDA Programming and Performance	8	7719	November 14, 2008
Cuda - Out of Memory Area CUDA Programming and Performance cuda	3	383	November 9, 2023
Local memory limit? CUDA Programming and Performance	1	12123	April 28, 2008
Thread-local memory address CUDA Programming and Performance	5	2358	December 8, 2021
Local memory usage CUDA Programming and Performance	5	1200	July 9, 2010
show sizes of GPU memory usage, eg log cudaMalloc, CUDA reports "out of memory" at runtime CUDA Programming and Performance	4	2142	December 13, 2016
Maximum number of instruction inside a Kernel CUDA Programming and Performance	9	2814	October 13, 2009
Per thread local memory Per thread local memory specified in C Programming Guide CUDA Programming and Performance	1	846	March 6, 2012
Thread Local variable CUDA Programming and Performance	1	1650	September 23, 2009
efficient static arrays in kernel CUDA Programming and Performance	2	2310	March 31, 2009

Out of memory when allocating local memory

Related topics