Does anyone know if there is a maximum number of global objects that can be allocated via a series of cudaMalloc() calls? From everything I can determine via experimentation, my Quadro 3700 (512MB) gives “out of memory” errors after 72,407 calls, regardless of the allocation size. cuMemInfo() shows that the global memory has in fact been consumed after the cudaMalloc() call failure. My C1060 (4GB) board behaves the same but is able to allocate 129,814 objects, but after the first cudaMalloc failure, I see that I still have around 12MB of global memory free via cuMemInfo.
I thought that perhaps cudaMalloc has a very high alignment value, but the Quadro board would have to allocate on at least a 4K boundary and the C1060 would have to align on about 32K boundary to consume the observed maximum number of allocations. Those seem very large if that is what is happening.
Alternatively, perhaps there is a CUDA global memory manager that only has a finite number of slots for allocated memory?
I’ve searched and read many threads here trying to find an answer to this, to no avail.
By the way, I realize this type of uncoelesced memory structure will be result in very poor kernel performance, but I’m trying to port a very complex existing program to CUDA and sort of need to get it working with existing data structure and then work on performance.