Dynamic memory allocation by several gangs at the same time

fredpz · October 27, 2024, 2:47pm

Hi,

I am porting HPC code to GPU mainly with openACC, and I am facing some design issues. On CPU, there are two nested loops with many calculations inside, and for each iteration of the inner loop, there may or may not be allocation of new memory, done with std::vector::push_back.

To port this to GPU, I set the outer-loop with acc parallel loop gang and the inner-loop with acc loop vector. I guess I am over-simplifying here, but it doesn’t matter.

I don’t really know how to handle the allocation of memory as it can’t be determined in advance. There isn’t enough space to allocate memory at the beginning, assuming as much space as the total number of iterations. I thought I could make the loop twice: one that calculates the needed memory, and the second one that uses the allocated space, but the cost of running the loop twice is very high.

As the inner loop is small, I also thought about allocating memory not in the inner-loop, but inside the outer-loop (at the gang level). Each gang could allocate some memory that could be later combined with all other gangs. But I can’t figure out how.

Is there some known trick to handle this situation? Is there, for example, a gang-blocking memory allocation technique ? Or something with acc atomic ?

fredpz · October 27, 2024, 3:36pm

One more possibility I have been thinking about. I could have the array defined as a thrust::device_vector somewhere before the loop. But then how can I synchronize gangs to wait when the vector is being resized?

MatColgrove · October 28, 2024, 4:40pm

Hi Fredpz,

Unfortunately you can’t use pushes or any operation that would resize the vector on the device. This is not thread-safe as you could have multiple threads each trying to resize the array. Plus this wouldn’t resize the host copy of the vector causing issues when trying to sync the two and device side allocation is very slow.

There isn’t enough space to allocate memory at the beginning, assuming as much space as the total number of iterations.

A couple of ideas:

Use a higher level parallel model, such as MPI, to use multiple GPUs
Use a blocking algorithm where the iterations a segmented into chunks size to the GPU memory.
Use CUDA Unified Memory (UM) either “-gpu=mem:managed” if only using allocated “heap” memory, or “-gpu=mem:unified” for both static and allocated memory if on a system with HMM support (like a Grace-Hopper).

UM will be the easiest since it doesn’t require any code changes and UM allows for oversubscription of device memory. There may be a performance penalty in data movement (less so on a Grace-Hopper).

Blocking can help a bit with performance, but you still are needing to copy data back and forth.

Using MPI with multiple GPUs would likely be the fastest method, but does require extra programming (such as the domain decomposition) and communication overhead.

In all cases, memory should be allocated from the host.

I could have the array defined as a thrust::device_vector somewhere before the loop. But then how can I synchronize gangs to wait when the vector is being resized?

Granted I haven’t used these much myself, but I don’t think it will help here. You still need to allocate the device vector from the host. The only advantage would be that these are device only, but if you want the data on the host, then you still need manage a host copy.

Plus these are primarily used in Thrust calls and if you want to move to Thrust you should instead look at C++ Standard Language parallelism for portability.

Hope this helps,
Mat

fredpz · October 28, 2024, 5:31pm

Thank you for your reply.

I am already within an MPI parallel model with domain decomposition. This piece of code I want to port occurs within one GPU only (each GPU does the same).

If I segment the data into blocking chunks, I expect a big drop in performance due to the fact that I am already not filling the GPU with very much work to do (this is a heavily memory-bound problem).

I am not sure how UM would help here? Could you explain the idea?

And what about malloc-ing small memory chunks at each gang-level iteration? Is this possible at all? The memory could be consolidated into a thrust vector after the loop.

MatColgrove · October 28, 2024, 6:14pm

UM allows oversubscription. It doesn’t solve the device side allocation issue, just solves the issue of having more data than can fit in the available device memory.

And what about malloc -ing small memory chunks at each gang-level iteration? Is this possible at all?

Yes, device side allocation via malloc is available but not recommended. malloc calls get serialized so can cause performance issues. Also the device heap is quite small by default, so be sure to increase heap size by either calling “cudaSetDeviceLimit” with " cudaLimitMallocHeapSize" or by setting the environment variable “NV_ACC_CUDA_HEAPSIZE”.

The memory could be consolidated into a thrust vector after the loop.

Maybe? I’ve never tried this myself so don’t really know, but would think it would be problematic.

The pointer to the memory will be private to each gang, so you’ll then need to gather all of these pointers into a global array of pointers so they aren’t lost. Then somehow insert them into the top level vector.

fredpz · October 28, 2024, 8:29pm

Ok I will attempt a few things from your advice and report eventually.

One detail concerning the malloc function in a gang loop. I cannot find in the openacc documentation whether I should use the usual c function malloc or cudaMalloc. Thank you again for your help

MatColgrove · October 28, 2024, 9:48pm

What language constructs are supported in device code is an implementation detail, not part of OpenACC itself.

In this case, it’s what’s supported by the underlying CUDA device runtime, which is malloc. Example 7.36.3.3 might be good for you to review as it shows how to capture the arrays allocated in each block (i.e. gang) and use them in later kernels. Also note the limitations of not being able to use device malloc’d data in the host API calls.

Topic		Replies	Views
Allocate within a parallel loop using OpenACC Legacy PGI Compilers	2	80	May 6, 2025
OpenACC Multi GPU Memory Informations nvc, nvc++ and nvfortran	6	438	February 14, 2024
Multiple GPU, device-memory CUDA Programming and Performance	1	1945	April 23, 2010
Unknown 8GB memory getting allocated on GPU Legacy PGI Compilers	12	9810	December 7, 2020
About allocating arrays Legacy PGI Compilers	5	2262	July 30, 2018
Multi-GPU Memory Allocation behaves differently with different order of allocation CUDA Programming and Performance	1	820	June 15, 2013
MultiGPU example in the CUDA SDK some stack problems CUDA Programming and Performance	5	3195	March 11, 2018
allocatable arrays inside device data structures Legacy PGI Compilers	5	7387	August 10, 2017
Memory allocation by GPU -> Possible in future releases? CUDA Programming and Performance	5	3862	March 13, 2009
Acc_malloc() in Fortran to avoid host allocation nvc, nvc++ and nvfortran	4	1021	August 8, 2024

Dynamic memory allocation by several gangs at the same time

Related topics