I am porting HPC code to GPU mainly with openACC, and I am facing some design issues. On CPU, there are two nested loops with many calculations inside, and for each iteration of the inner loop, there may or may not be allocation of new memory, done with std::vector::push_back.
To port this to GPU, I set the outer-loop with acc parallel loop gang and the inner-loop with acc loop vector. I guess I am over-simplifying here, but it doesn’t matter.
I don’t really know how to handle the allocation of memory as it can’t be determined in advance. There isn’t enough space to allocate memory at the beginning, assuming as much space as the total number of iterations. I thought I could make the loop twice: one that calculates the needed memory, and the second one that uses the allocated space, but the cost of running the loop twice is very high.
As the inner loop is small, I also thought about allocating memory not in the inner-loop, but inside the outer-loop (at the gang level). Each gang could allocate some memory that could be later combined with all other gangs. But I can’t figure out how.
Is there some known trick to handle this situation? Is there, for example, a gang-blocking memory allocation technique ? Or something with acc atomic ?
One more possibility I have been thinking about. I could have the array defined as a thrust::device_vector somewhere before the loop. But then how can I synchronize gangs to wait when the vector is being resized?
Unfortunately you can’t use pushes or any operation that would resize the vector on the device. This is not thread-safe as you could have multiple threads each trying to resize the array. Plus this wouldn’t resize the host copy of the vector causing issues when trying to sync the two and device side allocation is very slow.
There isn’t enough space to allocate memory at the beginning, assuming as much space as the total number of iterations.
A couple of ideas:
Use a higher level parallel model, such as MPI, to use multiple GPUs
Use a blocking algorithm where the iterations a segmented into chunks size to the GPU memory.
Use CUDA Unified Memory (UM) either “-gpu=mem:managed” if only using allocated “heap” memory, or “-gpu=mem:unified” for both static and allocated memory if on a system with HMM support (like a Grace-Hopper).
UM will be the easiest since it doesn’t require any code changes and UM allows for oversubscription of device memory. There may be a performance penalty in data movement (less so on a Grace-Hopper).
Blocking can help a bit with performance, but you still are needing to copy data back and forth.
Using MPI with multiple GPUs would likely be the fastest method, but does require extra programming (such as the domain decomposition) and communication overhead.
In all cases, memory should be allocated from the host.
I could have the array defined as a thrust::device_vector somewhere before the loop. But then how can I synchronize gangs to wait when the vector is being resized?
Granted I haven’t used these much myself, but I don’t think it will help here. You still need to allocate the device vector from the host. The only advantage would be that these are device only, but if you want the data on the host, then you still need manage a host copy.
Plus these are primarily used in Thrust calls and if you want to move to Thrust you should instead look at C++ Standard Language parallelism for portability.
I am already within an MPI parallel model with domain decomposition. This piece of code I want to port occurs within one GPU only (each GPU does the same).
If I segment the data into blocking chunks, I expect a big drop in performance due to the fact that I am already not filling the GPU with very much work to do (this is a heavily memory-bound problem).
I am not sure how UM would help here? Could you explain the idea?
And what about malloc-ing small memory chunks at each gang-level iteration? Is this possible at all? The memory could be consolidated into a thrust vector after the loop.
UM allows oversubscription. It doesn’t solve the device side allocation issue, just solves the issue of having more data than can fit in the available device memory.
And what about malloc -ing small memory chunks at each gang-level iteration? Is this possible at all?
Yes, device side allocation via malloc is available but not recommended. malloc calls get serialized so can cause performance issues. Also the device heap is quite small by default, so be sure to increase heap size by either calling “cudaSetDeviceLimit” with " cudaLimitMallocHeapSize" or by setting the environment variable “NV_ACC_CUDA_HEAPSIZE”.
The memory could be consolidated into a thrust vector after the loop.
Maybe? I’ve never tried this myself so don’t really know, but would think it would be problematic.
The pointer to the memory will be private to each gang, so you’ll then need to gather all of these pointers into a global array of pointers so they aren’t lost. Then somehow insert them into the top level vector.
Ok I will attempt a few things from your advice and report eventually.
One detail concerning the malloc function in a gang loop. I cannot find in the openacc documentation whether I should use the usual c function malloc or cudaMalloc. Thank you again for your help
What language constructs are supported in device code is an implementation detail, not part of OpenACC itself.
In this case, it’s what’s supported by the underlying CUDA device runtime, which is malloc. Example 7.36.3.3 might be good for you to review as it shows how to capture the arrays allocated in each block (i.e. gang) and use them in later kernels. Also note the limitations of not being able to use device malloc’d data in the host API calls.