Maximum Number of cudaMalloc() Calls

Does anyone know if there is a maximum number of global objects that can be allocated via a series of cudaMalloc() calls? From everything I can determine via experimentation, my Quadro 3700 (512MB) gives “out of memory” errors after 72,407 calls, regardless of the allocation size. cuMemInfo() shows that the global memory has in fact been consumed after the cudaMalloc() call failure. My C1060 (4GB) board behaves the same but is able to allocate 129,814 objects, but after the first cudaMalloc failure, I see that I still have around 12MB of global memory free via cuMemInfo.

I thought that perhaps cudaMalloc has a very high alignment value, but the Quadro board would have to allocate on at least a 4K boundary and the C1060 would have to align on about 32K boundary to consume the observed maximum number of allocations. Those seem very large if that is what is happening.

Alternatively, perhaps there is a CUDA global memory manager that only has a finite number of slots for allocated memory?

I’ve searched and read many threads here trying to find an answer to this, to no avail.

By the way, I realize this type of uncoelesced memory structure will be result in very poor kernel performance, but I’m trying to port a very complex existing program to CUDA and sort of need to get it working with existing data structure and then work on performance.

Thank you,
Ken Chaffin

From what I can see, it looks like (at least for CUDA 2.3 on a 1.1 capability device) that the allocation page size is either 4kb or 64kb beyond an initial 16Mb of pre-allocated memory per context. I posted a little test program you can use to see how it works here

Thanks for the reply and info. This seems consistent with what I’m seeing, although the different page sizes upon different cudaMalloc() calls was confusing me.

Looks like I’ll need to write my own memory manager which divvies up large chunks into smaller bites.