Device Memory Mangement

I may have asked about this before, but since I saw another semi-related post this morning, I figured I’d ask (again)…

The CUDA interface provides a method to see how much total/free memory is available on the device, but I would like to know if there was a way to list all currently allocated memory blocks (their starting address and length). The reason I ask is, it would be useful in the case where there is enough free memory on the device to store some data, but if there is already some allocated memory, the free space might be fragmented, and therefore you might be unable to actually store your data on the card.

Since I know someone will suggest just keeping track of whatever I allocate, I’m specifically curious about this because if there are multiple CUDA applications running (switching contexts, etc.) they may not “know” where each other’s memory is already allocated.

Ehh, trying to allocate and checking for an error return gives the same information or not?

I do not think what you ask is currently available, and I cannot really see what more use it has than trying to allocate your memory (preferably biggest array first, smallest array last) and checking for errors. This especially in the case of multiple applications. In the time between checking if there is enough memory available and allocating your memory, another context might have ‘stolen’ the memory you wanted.

Well, I understand it’s probably not made officially available by the API, but in order to know when to allow memory allocation or when to return an error, the driver must be keeping a list of where memory has already been allocated. I doubt that it would be a huge stretch for the CUDA developers to add this to an upcoming API release.

Also, just attempting to allocate memory and checking for errors, you’d need to try every single memory cell; if you made a sort of ‘sieve’ to check general spaces, you’d still run the risk of having a small chunk of memory in the middle, which would then cause allocation to fail when try to allocate the larger piece.

As for the contexts, I wonder if the driver already ‘locks’ the memory location to a context when that context requests the allocation. This would keep other contexts from stealing the memory. The CUDA API should also support a function where contexts can allocate ‘freeable’ memory, i.e. a context allocates memory, but can mark it ‘freeable’ so that if another context requests memory (and perhaps there is no actual ‘free’ memory for it to use), the driver can free the ‘freeable’ memory, the new context gets it’s memory allocated, and the original context is (somehow) notified that a block of memory that it ‘owned’ has been freed and given to another context.

There is cuGetMemInfo in the driver api, but this isn’t really all that useful if you’re using the runtime api. You could I suppose start up a new context using driver api and get the free memory but then you need to account for each context using ~30mb of device memory (in my experience anyway). Also bear in mind on Vista (maybe an issue for you, is for me) that this call always returns the full memory of the card (minus 30mb and however much you allocated in the context) regardless of how many other contexts may have already allocated memory due to GPU virtualisation by Vista. So it’s real fun when you allocate more memory than the card has and it pages in and out (well I assume it’s paging because the host cpu goes manic). In my experience getting free memory counts is a bit of a pain manually keeping score and if you’re using multiple processes that allocate cuda device memory it gets even nastier in Vista (I am not sure I should really have to do interprocess comms to keep a consistent count of device mem).

I’d really like to see some more api support in this area, I assume 2.1beta hasn’t changed anything but I’ve not actually checked.

cuGetMemInfo is declared safe to be called from runtime API programs (assuming they have already created a context by making a cuda* call)

I only use the driver API, but cuGetMemInfo() still doesn’t address the memory fragmentation problem…the only thing that would work is for nVidia to provide an API function that lists all of the currently allocated memory blocks (and perhaps also provides a pointer to the context that has allocated the memory).

I think this would be a good feature for an upcoming API release, since it gives us more control over when/if we need to allocate new memory. As hill_matthew wrote, this is a real problem in Vista, since apparently Vista’s GPU virtualization doesn’t play well with the memory info functions in the current drivers.

Just to play devil’s advocate: why should NVIDIA provide such an API? Windows / linux / mac os X don’t provide such information to programs which must rely solely on malloc and free to handle memory. Why should working on the GPU be any different?

You must keep in mind what memory fragmentation is and when it is a problem.

Memory is virtualized. This means you don’t need contiguous free physical memory. Tiny scattered pieces of physical memory can be combined to give you a contiguous piece of address space.

Virtual address space is not shared between contexts. Each context can have an address 0x1000, and it will point to different physical memory. Nevertheless, a 32bit virtual address space may become fragmented. I think the chance of this is low when you’re using a 1GB gpu, but it’s a serious problem with a 4GB gpu.

I believe nvidia GPUs already support 64bit virtual addresses, and these get turned on in x64 OSs. Can someone confirm this?

Agreeing with this. The better solution (in this case) sounds like load balancing at context creation time among multiple GPUs.

I think this should be offered because people are used to looking over at their task manager, top, or whatever, to see what the current CPU load and memory allocation is. As far as I can see there isn’t really a way to do this in Cuda for the GPU without compiling this info yourself, it’s prime API territory and I just think it’s a bit of a missed opportunity, especially when Vista messes around with paging confusing matters. I hardly want to go and start doing platform specific stuff on a multi-platform API.

Sorry for hijacking somebody else’s post with my opinion on the matter, I realise this isn’t really going to help the poster solve his problem which is different to mine.

An aside: just because the competition doesn’t offer something shouldn’t be a reason for not offering it yourself if it adds value for your customers by allowing your partners to create better products. The better and more complete Nvidia’s APIs the more likely we are to develop apps that people are going to want to rush out and buy Cuda compatible cards to run them on.

That’s an argument for adding a task manager-esque API, not this particular call. I certainly understand arguments that we need to add something to support (for example) GPU top, but this particular call isn’t directly applicable to that.

Here’s looking forward to some future GPU toppage APIs :)

Sorry all, I didn’t mean to stir up a big fuss…I’m working on another project where I’ve got write my own memory manager, and I guess I forgot that CUDA already “knows” where to store data when memory is allocated (i.e. cuMemAlloc() just stores it wherever it wants and returns the pointer…in my other project, I’ve got to choose the memory locations myself and hope I don’t overwrite anything). However, I was also interested in a “Task Manager”-style watchdog for the GPU, which could perhaps also be used to kill/suspend any rogue programs running on the GPU. It would also be neat to see the context stack, and the host processes associated with each context.

Also, alex_dubinsky, that is interesting to know that the GPU memory is virtualized for each context. I thought that all contexts would share the same memory, which is why I brought this topic up in the first place. Whenever a new context is pushed onto the stack, what does the GPU do with the memory already stored on the device? Does it page it to host memory and pull it back when the original context is running again? Is that the difference between XP and Vista that hill_matthew mentioned?

Yes it might page out on Vista (with wonderful effects on performance). But only if there is not enough GPU memory for both contexts. Even discounting paging, virtualization solves the problem of fragmentation, and this works on XP too.

Btw, what would be interesting is to have an API that translates virtual GPU addresses to real ones. To give to PCI devices to do DMA to the GPU.

Ahh, yes. When doing your own memory manager without access to the MMU and pagetables, you don’t have the benefit of virtualization and fragmentation is a really big problem. If you want to be really fancy, you could implement your own software pagetables in smem. Might not even affect performance since it would overlap with the cost of the DRAM access.

I am probably misunderstanding all this, so forgive me if I am stupid ;) But when doing your own memory manager, what I would guess is:

  • allocate all GPU memory available as soon as you start (or the total amount your program needs)

  • do your own ‘allocation’ for arrays and such, keeping track yourself of where you put everything.

Why would you need any info after the first allocation from the GPU/CPU?