Allocating arrays in chunks

sBc-Random · October 9, 2018, 7:24pm

I’m looking into dynamic memory management in my code, and was wondering if there’s any tried and tested method of allocating, say, a 1GB array into, for arguments sake, 256 * 4 MB chunks - whereby each 4MB chunk is itself contiguous, but the 256 ‘chunks’ may appear anywhere in the card memory.

The reason for this is, I’m suffering from a 5* increase in runtimes without an implicit memory manager (ie just by calling cudaMallocs/cudaFree throughout the code), and the many contiguous array allocator methods I’m trying are failing for various reasons - slab allocators seem to be fragmenting too much, and cache allocators seem to be hard to tune (at least for my particular problem) and result in large amounts of memory sitting unused in-cache.

Ideally I’d like to ‘mask’ the chunked nature of the array, and still pass a float* to kernels (ie without using a float ** and taking up registers + memory read time), but honestly I’m open to all solutions!

saulocpp · October 9, 2018, 9:18pm

sBc-Random, how does the data structure you want to implement look like?
What came to my mind now, without knowing more, is a 2D array to store the data as you describe, like 256 columns of 4MB rows.
But since I believe you thought of this, another “solution” would be the nested allocation (in a loop for the second dimension), from which we normally transition in favor of 1D arrays.

The reason for this is that, as you know, you can deallocate individual rows, as I understood you want it dynamic. If it is easy, or at all possible to do with cudaMalloc, I don’t know. And I also question if this is in conflict with you wanting to avoid float **, as passing it to functions may be problematic.

sBc-Random · October 9, 2018, 9:55pm

Thanks for your response, that’s what I was thinking of (a for loop of 256 cudaMallocs), but I’m worried on the kernel performance side as as result :)

The problem I’m trying to address for my particular use case, I’m finding it hard to allocate 1 contiguous chunk of memory on the card (due to memory fragmentation and/or actually running out of memory, depending on the efficiencies of each memory management method I’ve been testing) - so I’m entertaining the possibility of using something similar to a sector on a hard drive, but I’m only in the early stages of thinking through it

saulocpp · October 9, 2018, 10:09pm

That’s pretty much the only advantage of using this nested allocation over the 1D implementation: it doesn’t require contiguous spaces. As for the performance penalty of allocation/deallocation, I don’t think there is a way around it if it is dynamic…

And just for the sake of asking, as I imagine you thought about it, doesn’t thrust::device_vector help?!

sBc-Random · October 9, 2018, 11:31pm

Sorry I didn’t explain myself properly

I was worried about the performance inside a kernel, in that it would need to keep finding the indices:

global void kernel(float **in, float **out)
{
float *input_data = in[blockIdx.x];
float *output_data = out[blockIdx.x];

...

}

etc

But I think it’s unavoidable (and also extremely messy for things like matrix multiplication)

saulocpp · October 10, 2018, 9:54am

If I understood now (maybe not), and your data is already 2D, then it means you would need a 3D data structure?
X,Y for your data, Z for the 256 levels of this data.
And things can now be really, really messy. Maybe it is worse than I thought? :)
Hey, I’m not discouraging anything!

Topic		Replies	Views
Allocating array in kernel CUDA Programming and Performance	3	2539	July 29, 2009
dynamic array issue? CUDA Programming and Performance	5	6470	November 3, 2009
Reallocating memory on CUDA? CUDA Programming and Performance	2	4884	January 17, 2011
Dynamic memory allocation inside the kernel CUDA Programming and Performance	2	17852	March 31, 2015
How to allocate memory dynamically in cuda kernel CUDA Programming and Performance	1	3714	February 22, 2009
malloc in a kernel CUDA Programming and Performance	2	1833	July 1, 2009
Variable array size within kernel? CUDA Programming and Performance	3	5285	September 5, 2008
Question Dynamic Memory Allocation in the kernel function CUDA Programming and Performance	2	3701	November 30, 2009
Fermi memory management different? CUDA Programming and Performance	10	1426	April 17, 2011
memory allocation problem CUDA Programming and Performance	2	4857	September 8, 2009

Allocating arrays in chunks

Related topics