Allocate within a parallel loop using OpenACC

Hi all,
I am trying to find a possible solution to dynamically allocate some data within a parallel loop , considering the following snippet:

real, allocatable :: result(:), globalvalues(:)

allocate (globalvalues(somedimextra))

!$acc parallel loop copy(globalvalues)
do i =1,N
allocate(result(somedim))
call routine (results, globalvalues)
deallocate(results)
enddo
!$acc end parallel loop

deallocate (globalvalues)

Actually I am using a simple workaround that is to statically declare “result” within the routine , but clearly in such a way I need to establish the dimension at compile time, and this is indeed suboptimal . A possible solution is maybe to allocate “result” as: “allocate (somedim,N)” or maybe something like “allocate (somedim, numofgangs)”, but the dimensions are gonna be too big. Are there any other solutions ?

thanks in advance

Hi loriano,

While you can allocate in device code, it’s generally not advisable. The default device heap is quite small so it’s easy to get a heap overflow. While you can increase the device heap size via an API call or setting the environment variable NV_ACC_CUDA_HEAPSIZE, device allocations get serialized causing performance slow downs.

Instead, I typically recommend allocating the temp array on the host and then putting it in a “private” clause. If the outer loop is a “gang” loop, then the size would be somedim * numofgangs. If it’s vector loop, then it’s somedim*numofgangs*vector_length, so can be large depending the size of somedim. Hence ideally, you want the outer loop to be a gang level, and the routine having a vector loop.

Something like:

real, allocatable :: result(:), globalvalues(:)

allocate (globalvalues(somedimextra))
allocate(result(somedim))

!$acc parallel loop gang copy(globalvalues) private(result)
do i =1,N
call routine (results, globalvalues)
enddo
!$acc end parallel loop

deallocate(results)
deallocate (globalvalues)

-Mat

Dear Mat
Indeed this has been my initial test, but if I do so I need to allocate “somedim * numofgangs” that’s gonna be quite a big amount of memory. Then within the loop I need to let each gang access its own part of the array as well , I will try some possible solutions and let you know.

Thanks again