Allocating arrays in chunks

I’m looking into dynamic memory management in my code, and was wondering if there’s any tried and tested method of allocating, say, a 1GB array into, for arguments sake, 256 * 4 MB chunks - whereby each 4MB chunk is itself contiguous, but the 256 ‘chunks’ may appear anywhere in the card memory.

The reason for this is, I’m suffering from a 5* increase in runtimes without an implicit memory manager (ie just by calling cudaMallocs/cudaFree throughout the code), and the many contiguous array allocator methods I’m trying are failing for various reasons - slab allocators seem to be fragmenting too much, and cache allocators seem to be hard to tune (at least for my particular problem) and result in large amounts of memory sitting unused in-cache.

Ideally I’d like to ‘mask’ the chunked nature of the array, and still pass a float* to kernels (ie without using a float ** and taking up registers + memory read time), but honestly I’m open to all solutions!

sBc-Random, how does the data structure you want to implement look like?
What came to my mind now, without knowing more, is a 2D array to store the data as you describe, like 256 columns of 4MB rows.
But since I believe you thought of this, another “solution” would be the nested allocation (in a loop for the second dimension), from which we normally transition in favor of 1D arrays.

The reason for this is that, as you know, you can deallocate individual rows, as I understood you want it dynamic. If it is easy, or at all possible to do with cudaMalloc, I don’t know. And I also question if this is in conflict with you wanting to avoid float **, as passing it to functions may be problematic.

Thanks for your response, that’s what I was thinking of (a for loop of 256 cudaMallocs), but I’m worried on the kernel performance side as as result :)

The problem I’m trying to address for my particular use case, I’m finding it hard to allocate 1 contiguous chunk of memory on the card (due to memory fragmentation and/or actually running out of memory, depending on the efficiencies of each memory management method I’ve been testing) - so I’m entertaining the possibility of using something similar to a sector on a hard drive, but I’m only in the early stages of thinking through it

That’s pretty much the only advantage of using this nested allocation over the 1D implementation: it doesn’t require contiguous spaces. As for the performance penalty of allocation/deallocation, I don’t think there is a way around it if it is dynamic…

And just for the sake of asking, as I imagine you thought about it, doesn’t thrust::device_vector help?!

Sorry I didn’t explain myself properly

I was worried about the performance inside a kernel, in that it would need to keep finding the indices:

global void kernel(float **in, float **out)
{
float *input_data = in[blockIdx.x];
float *output_data = out[blockIdx.x];

...

}

etc

But I think it’s unavoidable (and also extremely messy for things like matrix multiplication)

If I understood now (maybe not), and your data is already 2D, then it means you would need a 3D data structure?
X,Y for your data, Z for the 256 levels of this data.
And things can now be really, really messy. Maybe it is worse than I thought? :)
Hey, I’m not discouraging anything!