Accurately determining available global memory on a CUDA device

How do we accurately determine the amount of available global memory on a CUDA device?

Our problem is: We have an application that is limited by the total amount of available global memory, so we need to execute the same kernel multiple times against successive chunks of data. The memory allocations for the kernel are not homogeneous. They consist of four different arrays (of char, ushort2, and uint4), and the size of each array changes each time the kernel is executed.

We obviously want to maximize the amount of memory used by the arrays each time the kernel is executed. This means we need an accurate estimate of the amount of free global memory. To get that estimate, we call cuMemGetInfo.

Unfortunately, the amount of memory reported by cuMemGetInfo does not match the amount we expect to see based on the sizes of the arrays we allocate. For example, if we call cuMemGetInfo to determine the amount of free global memory, use cudaMalloc to allocate an array whose size is 1,000,000 bytes, and then call cuMemGetInfo again, we find that well over 1,000,000 bytes of memory were actually allocated.

Here is some “real” data for four consecutive calls to cudaMalloc (one for each array) taken on a GTX480 with driver v263.06 and with 1576599552 bytes of memory. The “expected/actual” values represent the amount of free global memory we expect to be available and the amount actually reported by cuMemGetInfo; the “diff” value is the difference between “expected” and “actual”:

06:04.524 callbackPreKernel for device 0: 1378340864 bytes free
06:05.448 callbackPreKernel for device 0: allocated 3674688 bytes: expected/actual bytes free = 1374666176/1374539776 (diff=-126400) 06:06.390 callbackPreKernel for device 0: allocated 153112 bytes: expected/actual bytes free = 1374386664/1373491200 (diff=-895464)
06:07.282 callbackPreKernel for device 0: allocated 313573376 bytes: expected/actual bytes free = 1059917824/1059835904 (diff=-81920)
06:08.185 callbackPreKernel for device 0: allocated 14698752 bytes: expected/actual bytes free = 1045137152/1045024768 (diff=-112384)

To begin with, there are 1576599552-1378340864 = 198258688 bytes of memory used by something other than our app. Since the device has no monitor attached, can you tell us what’s consuming that memory?

Then, as each cudaMalloc call executes, there is an additional, variable amount of extra memory being consumed – but what is it? If there is some kind of internal overhead for global memory allocation on the device, how do we quantify it?

For now, we have worked around this problem in the most stupid way possible: We subtract an additional 150 or 200 megabytes from the value reported by cuMemGetInfo. But I suspect there must be a better way!

The “diff” values you are seeing are most likely due to page sizes. When you make an allocation, the driver gives you a number of pages sufficient to hold your request. For large requests, the page sizes are large (1 Mb IIRC), so you can have anything up to almost a megabyte of “wasted” memory per allocation depending on the modulo of your request size and the page size the driver is using.

As for where the “rest” of the memory goes, CUDA contexts use space for buffers, state, local memory constant memory, program code, etc. How much state is used will depend a bit on your hardware and OS, but in my experience it is typical to expect CUDA to use 50-100Mb of free space on a non display GPU.

My advice is to look after memory allocation yourselves. In the beginning of your code, allocate as much free memory as cuMemGetInfo tells you is free, then use your own memory management code. You cut a lot of API overhead away and you can have word level degree of control over memory allocation if you want to.

Hi,
just found this one, and I’m in exactly the same situation:
I need to determine the amount of available and required memory for allocations as exact as possible. (exact numbers at compile time would make sense).

I’m working on C2070s.
It seems as if 104520 kB are in use right from the start.
How can I estimate this number? Is it a fixed value or does it vary depending on the rest of the code?

I also noted that allocating memory requires more mem than expected.
Here again it would be very interesting to know how much mem I have to expect to be required for my data chunks.

My application is roughly as follows:
A giant piece of 2D data, processed at chunks of multiple rows in parallel.

  • get mem for $some$ rows
    – copy a chunk of data
    – process rows
    – read back results
    — copy next chunk
    — process rows
    — read back results
    and so on.

The “fun part” is that the dimensions of the processed 2D data vary, but are known at compile time.
So it can be 50.000 x 110.000 as well as 1.000.000 x 3000.

Can someone please direct me somewhere where I can find those details or give me an appropriate rule of thumb with reasonable “security margin” that can be used?

Markus