How do we accurately determine the amount of available global memory on a CUDA device?
Our problem is: We have an application that is limited by the total amount of available global memory, so we need to execute the same kernel multiple times against successive chunks of data. The memory allocations for the kernel are not homogeneous. They consist of four different arrays (of char, ushort2, and uint4), and the size of each array changes each time the kernel is executed.
We obviously want to maximize the amount of memory used by the arrays each time the kernel is executed. This means we need an accurate estimate of the amount of free global memory. To get that estimate, we call cuMemGetInfo.
Unfortunately, the amount of memory reported by cuMemGetInfo does not match the amount we expect to see based on the sizes of the arrays we allocate. For example, if we call cuMemGetInfo to determine the amount of free global memory, use cudaMalloc to allocate an array whose size is 1,000,000 bytes, and then call cuMemGetInfo again, we find that well over 1,000,000 bytes of memory were actually allocated.
Here is some “real” data for four consecutive calls to cudaMalloc (one for each array) taken on a GTX480 with driver v263.06 and with 1576599552 bytes of memory. The “expected/actual” values represent the amount of free global memory we expect to be available and the amount actually reported by cuMemGetInfo; the “diff” value is the difference between “expected” and “actual”:
06:04.524 callbackPreKernel for device 0: 1378340864 bytes free
06:05.448 callbackPreKernel for device 0: allocated 3674688 bytes: expected/actual bytes free = 1374666176/1374539776 (diff=-126400) 06:06.390 callbackPreKernel for device 0: allocated 153112 bytes: expected/actual bytes free = 1374386664/1373491200 (diff=-895464)
06:07.282 callbackPreKernel for device 0: allocated 313573376 bytes: expected/actual bytes free = 1059917824/1059835904 (diff=-81920)
06:08.185 callbackPreKernel for device 0: allocated 14698752 bytes: expected/actual bytes free = 1045137152/1045024768 (diff=-112384)
To begin with, there are 1576599552-1378340864 = 198258688 bytes of memory used by something other than our app. Since the device has no monitor attached, can you tell us what’s consuming that memory?
Then, as each cudaMalloc call executes, there is an additional, variable amount of extra memory being consumed – but what is it? If there is some kind of internal overhead for global memory allocation on the device, how do we quantify it?
For now, we have worked around this problem in the most stupid way possible: We subtract an additional 150 or 200 megabytes from the value reported by cuMemGetInfo. But I suspect there must be a better way!