Memory required for kernel launch

How can i predict the memory that a kernel requires to be launched? I’m assuming that at a kernel launch, the runtime allocates local and constant memory based on the number of threads, and should that allocation fail, it returns CUDA_ERROR_OUT_OF_MEMORY.

For estimating the required memory, do I need to look at all entry points in a module? If I just take the initial entry point’s local memory, multiply it by number of threads and add the size of const memory, I get a number that seems to be much smaller than what’s actually being used by the runtime.

In order to launch a kernel the CUDA driver must allocate a number of different device buffers. The drivers attempts to defer allocation of many of the global buffers until they are used. For example, the driver will not allocate the device heap buffer until a call containing malloc/new is about to launch and the driver will defer allocation or resize of the printf buffer until it sees a launch request that uses the printf syscall.

The driver has to allocate and manage the following device memory buffers (CUcontext in driver API)

A. local memory allocation
B. constant buffers
C. printf buffer allocation
D. device memory heap allocation

The local memory allocation can be estimated by querying a number of variables.

The per thread requirement for the launch can be queried using cudaFunctionGet() and reading localSizeBytes.
The device variables can be queried using cudaGetDeviceProperties (or cudaDeviceGetAttribute).

The estimated size can be achieved using the following formula

lmemDeviceAllocation = ROUNDUP((localSizeBytes + 512 (syscall stack)), lmemRoundUpSize) * cudaDeviceProp.maxThreadsPerMultiprocessor * cudaDeviceProp.multiProcessorCount

where lmemRoundUpSize is 128 or 256 bytes.

There is only one local memory allocation per CUcontext or CUDA runtime device. This is not a per launch allocation. If a launch requires additional local memory then the driver has to synchronize the device and re-allocate the local memory allocation. The allocation size is based upon the maximum resident threads on the device and is independent of the launch grid and block dimensions.

The function cudaSetDeviceFlags(cudaDevieLmemResizeToMax) can help eliminate the number of stalls due to buffer resizing.

The constant buffers are allocated (a) per stream (up to a maximum number then the driver shares constant buffers among streams, and (b) per CUmodule by the compiler. The per stream allocation is used to pass the launch parameters, texture headers, sampler headers, and other per launch data. The module constant banks are created by the compiler and depend on the number of explicit device constants declarations and implicit constants used in device code that the compiler allocates into a constant bank.

The printf buffer size can be queried using cudaDeviceGetLimit(,cudaLimitPrintfFifoSize).

The device memory heap size can be queried using cudaDeviceGetLimit(,cudaLimitMallocHeapSize).

The various CUDA profilers can capture and display a good portion of this information.

Thank you, that was a very detailed and helpful answer.

I need to use this answer to compute kernel memory so I can leave enough after my cudaMalloc’s to be able to run my kernel (out of memory errors a runtime is a bummer). Can someone clarify the “512(syscall stack)” part of the equation? Where can I find “syscall stack” value (or bound it), and is it 512*(syscall stack) or is 512 the “syscall stack” (I’m unsure due to the mixed notation).

Additionally (and because it’s especially pertinent to this question) is there a way to require less than “maxThreadsPerMultiprocessor”? My kernel will usually be called with N*32 threads/block (where N is low single digits) and having to set aside for 2048 is wasting considerable memory.