When allocating buffers on the GPU, there is a noticeable difference between the requested size of a buffer and the actual reduction of free memory reported by cudaMemGetInfo. On my system, any allocation reduces the amount of free memory that cudaMemGetInfo reports by a multiple of 2^21 bytes (2 MiB). For small buffer sizes, subsequent allocations will simply not consume any additional memory, as they apparently still fit into the 2 MB of the previous allocation. But for buffers larger than 2^20 bytes, each allocation seems to consume the same multiple of 2^21 bytes, and that multiple can be more than twice as large as the requested buffer size, although the relative overhead seems to get smaller when allocating larger buffers. I’m observing this behavior for both cudaMalloc and cudaMallocArray.
In order to subdivide work items into processable chunks, being able to predict how much memory a buffer of a given size will effectively consume is necessary. In a realistic use case, a large number of CUDA arrays (sized ~900x700) consumed more than 1.5x the requested amount of memory, causing out of memory errors.
Is there a way to determine how much memory a buffer will effectively consume if allocated, either via a formula or by asking the CUDA API in advance?