In order to launch a kernel the CUDA driver must allocate a number of different device buffers. The drivers attempts to defer allocation of many of the global buffers until they are used. For example, the driver will not allocate the device heap buffer until a call containing malloc/new is about to launch and the driver will defer allocation or resize of the printf buffer until it sees a launch request that uses the printf syscall.
The driver has to allocate and manage the following device memory buffers (CUcontext in driver API)
A. local memory allocation
B. constant buffers
C. printf buffer allocation
D. device memory heap allocation
The local memory allocation can be estimated by querying a number of variables.
The per thread requirement for the launch can be queried using cudaFunctionGet() and reading localSizeBytes.
The device variables can be queried using cudaGetDeviceProperties (or cudaDeviceGetAttribute).
The estimated size can be achieved using the following formula
lmemDeviceAllocation = ROUNDUP((localSizeBytes + 512 (syscall stack)), lmemRoundUpSize) * cudaDeviceProp.maxThreadsPerMultiprocessor * cudaDeviceProp.multiProcessorCount
where lmemRoundUpSize is 128 or 256 bytes.
There is only one local memory allocation per CUcontext or CUDA runtime device. This is not a per launch allocation. If a launch requires additional local memory then the driver has to synchronize the device and re-allocate the local memory allocation. The allocation size is based upon the maximum resident threads on the device and is independent of the launch grid and block dimensions.
The function cudaSetDeviceFlags(cudaDevieLmemResizeToMax) can help eliminate the number of stalls due to buffer resizing.
The constant buffers are allocated (a) per stream (up to a maximum number then the driver shares constant buffers among streams, and (b) per CUmodule by the compiler. The per stream allocation is used to pass the launch parameters, texture headers, sampler headers, and other per launch data. The module constant banks are created by the compiler and depend on the number of explicit device constants declarations and implicit constants used in device code that the compiler allocates into a constant bank.
The printf buffer size can be queried using cudaDeviceGetLimit(,cudaLimitPrintfFifoSize).
The device memory heap size can be queried using cudaDeviceGetLimit(,cudaLimitMallocHeapSize).
The various CUDA profilers can capture and display a good portion of this information.