Inconsistent performance with !$acc exit data copyout finalize and NV_ACC_MEM_MANAGE environmental variable

Hello,

I am working on offloading a large MPI/OpenMP Fortran code to GPU using OpenACC. Since the offloaded task is too big data-wise to fit into our GPUs as is, I introduced partial data copy to and from device asynchronously using pinned memory. Still, this task requires careful management of the GPU memory to make it fit onto the GPU.

From other posts, such as this or this, I read that by default the GPU does not deallocate the memory upon encountering the copyout or delete statement but rather adds that memory to the device memory pool. For further context, let’s have a look at a pseudo-code that represents the structure of the GPU-enabled part in question in our case:


!$acc enter data copyin(global_data1,global_data2)
<some CPU operations>

do iz = 1, iz_end

    !$acc enter data copyin(p5F_float_holder(:,:,:,:,iz)) async(2)

    !$acc parallel loop gang collapse(2) private(p3_end) default(present) present(p5F_float_holder(:,:,:,:,iz),global_data1,global_data2) async(1)
    do p1 = 1, p1_end
        do p2 = 1, p2_end
            p3_end = global_data1(p2,p1)

            !$acc loop worker vector private(p5_float) collapse(2)
            do p3 = 1, p3_end
                do p4 = 1, p4_end
                    p5_float = p5F_float_holder(p4,p3,p2,p1,iz)
                    p5_end = global_data2(p3,p1)
                    !$acc loop seq
                    do p5 = 1, p5_end

                        <a sequence of numerical operations>
                        <result> = <a sequence of numerical operations>
                        p5_float = p5_float + <result>

                    enddo
                    p5F_float_holder(p4,p3,p2,p1,iz) = p5_float
                enddo
            enddo
        enddo
    enddo

    !$acc enter data copyin(p5B_float_holder(:,:,:,:,iz)) async(2)
    !$acc wait

    !$acc parallel loop gang collapse(2) private(p3_end) default(present) present(p5B_float_holder(:,:,:,:,iz),global_data1,global_data2) async(1)
    do p1 = 1, p1_end
        do p2 = 1, p2_end
            p3_end = global_data1(p2,p1)

            !$acc loop worker vector private(p5_float) collapse(2)
            do p3 = 1, p3_end
                do p4 = 1, p4_end
                    p5_float = p5B_float_holder(p4,p3,p2,p1,iz)
                    p5_end = global_data2(p3,p1)
                    !$acc loop seq
                    do p5 = p5_end,1,-1

                        <a sequence of numerical operations>
                        <result> = <a sequence of numerical operations>
                        p5_float = p5_float + <result>

                    enddo
                    p5B_float_holder(p4,p3,p2,p1,iz) = p5_float
                enddo
            enddo
        enddo
    enddo

    !$acc enter data copyout(p5F_float_holder(:,:,:,:,iz)) async(3) finalize
    !$acc wait(1)

    !$acc parallel loop collapse(4) default(present) async(1)
    do p1 = 1, p1_end
        do p2 = 1, p2_end
            do p6 = 1, p6_end
                do p4 = 1, p4_end

                    <a sequence of numerical operations>

                enddo
            enddo
        enddo
    enddo

    !$acc enter data copyout(p5B_float_holder(:,:,:,:,iz)) async(3) finalize
enddo

The code is executed in parallel on several GPUs (and several nodes, potentially), and for each MPI process we have a decomposed part of the data that is omitted here. We have 2 GPU nodes, each equipped with 8 NVIDIA RTX A5000 24GB cards, both nodes being identical, with identical equipment and connected to the master node via identical Infiniband. We are using Rocky Linux 8.7 and HPC SDK 24.5 installed natively via yum. The driver version and CUDA version can be found in the below screenshot:

As shown in the pseudo-code, the idea is to split the transferred data in chunks via the sequential iz loop that runs on the host side and overlap the transfer to and from device with the kernel runs. For freeing the memory, we add the finalize statement to all copyouts. The memory occupancy with just global data (before we start the iz loop) is shown in the screenshot above. Once we run the iz loop, the memory occupancy increases by around 6 GB per card, making it close to the GPU limit. For some problems, this is still enough, but for others, we get an “out of memory” error due to memory overflow. That increased memory value stays even after we quit the iz loop and move to CPU part of our computation. The solution we found is to use the “-x NV_ACC_MEM_MANAGE=0” flag when running the code. If using the flag, the memory occupancy stays as shown in the screenshot even during the iz loop execution. The problem is that this approach works well for one node, let’s call it node1, but when we run the same executable at the identical node2, it performs much worse and slows down a lot (can be a several times difference), while the power consumption and GPU occupancy in nvidia-smi drops from the maximum values to about 100W and 40-50%. What is strange is that after we manually restart node2, it performs just like node1, but after several runs, it starts showing the symptoms again. We found that by removing the NV_ACC_MEM_MANAGE flag, we can get the normal performance out of that node again (and increase the GPU memory occupancy as a tradeoff), but this inconsistency in performance is really bothering us, so we would like to dig deeper and see what is causing it. Once again, we do not experience any slowdown over time with the node1, which is identical to node2.

The compilation flags are: -fast -O3 -mp -cuda -acc -traceback -Minfo=accel -cpp -Mlarge_arrays -Mbackslash -gpu=cc86,deepcopy,cuda12.4,lineinfo,safecache

Our questions are:

  1. What is the “device memory pool” and how is it different from just free GPU memory?
  2. Based on the symptoms, it looks like some resource is being filled up during the fresh run on node2. After it fills up completely, the performance drops. What resource could that be, and how to free it up or reset without restarting the node?
  3. What is the expected behavior of the finalize clause in the context of our code and compilation flags? Can there be some conflicts with the copyout being async or the use of pinned memory?
  4. What can we do to investigate this problem further?

We would appreciate any insights or suggestions on how to solve that problem, thanks in advance!

Hi s.dzianisau,

While I’m not sure about the node2 issues, I do see some potential issues with your use of “async”. Let me first attempt to answer your specific questions, then add my comments and suggested code changes.

  1. What is the “device memory pool” and how is it different from just free GPU memory?

In this context, the “device memory pool” is the memory managed by the OpenACC runtime. Some of this data may be free within the pool and available to the runtime to reuse, but nvidia-smi would see this as allocated (used) memory. The free GPU memory would be what’s available not used by the pool or other allocated components such as the CUDA Context.

  1. Based on the symptoms, it looks like some resource is being filled up during the fresh run on node2. After it fills up completely, the performance drops. What resource could that be, and how to free it up or reset without restarting the node?

I’ve not seen this behavior before so not sure. Possibly a timing issue with due to the way you have the async queues (aka CUDA Streams). More on this later.

  1. What is the expected behavior of the finalize clause in the context of our code and compilation flags? Can there be some conflicts with the copyout being async or the use of pinned memory?

The only thing finalize does is to set the reference counter in the present table to zero. Though it’s extraneous here since the reference count would go to zero anyway as you have matching copy in and copy out data directives.

First, all data transfers must use pinned host memory. By default we use a double buffering system where host virtual memory is copied to a pinned memory buffer and then copied asynchronously to the device. While the first buffer is being copied, the second buffer is filled. This can effectively hide much of the virtual to pinned memory copy time depending on the amount memory being transferred.

With the “-gpu=pinned” flag, instead of using the buffers, the host data is allocated directly in physical host memory. The overhead to allocated this memory is much higher and you are limited by the amount of physical memory on the host. For most programs, using “pinned” is not beneficial. The only cases where it does help is when there are few host allocations but many data copies of that memory between the host and device. Which does appear to be the case here.

“async” does two things. It tells the OpenACC runtime to not block and continue running host code which it does in both cases (pinned vs buffering). The assign the underlying calls to cudaMemCpyAsunc to a CUDA stream when an async queue id is given, or the default stream when “async” is used without an id.

The bottom line is that, no, I don’t think this would cause a conflict.

  1. What can we do to investigate this problem further?

Let’s look at the first problem of running out of GPU memory when the pool allocator is used.

The primary reason why we use the pool allocator is to save the cost of reallocating memory especially when the program is reallocating memory of the same size. This is the case with your program so I would expect that this optimization would be beneficial to you.

However when used with “async”, the pool allocator can’t free this memory until it reaches a “wait” since it doesn’t know when the data is finished copying. Since your program has the “wait” in the middle of the loop, it can’t free queue id #3’s memory for one iteration until after it allocates the next iteration’s memory. This is likely causing the out-of-memory issues.

While I don’t have a complete view of your code, I believe how you’re using “async” is incorrect, or at least ineffective.

Your code has three loops, section A where “p5F” is computed, section B for “p5B”, and section C which does some numerical operations. A and B appear to be independent and could be run concurrently on the device. You don’t show any code for C so I’m assuming C depends on A and B to complete before starting.

The async queue ids create a dependency graph. Items on the same queue ids will be run in sequence. Items on different queues can be run independent from each other.

Though here you’ve put the compute regions on queue #1 and the data on queues #2 and 3. This has two problems in that the data copy may not be completed before it’s used in the compute region and the independent sections are run in sequence when they could be run concurrently. Plus your waits are misplaced.

I should note that the compute regions may not be completely run concurrently. How much overlap will depend on how much of the GPU is used. So if the first compute region uses all of the GPU, the second won’t start until the first starts freeing up the resources. You can see how much overlapping computation is done by viewing the timeline from a Nsight-Systems profile in a GUI.

What I propose you try is something like the following:

!$acc enter data copyin(global_data1,global_data2)
<some CPU operations>

do iz = 1, iz_end

! put the dependent "p5F" data on the same queue as Section A
    !$acc enter data copyin(p5F_float_holder(:,:,:,:,iz)) async(1)

! Start copying the "p5B" data for section B here on queue 2 
   !$acc enter data copyin(p5B_float_holder(:,:,:,:,iz)) async(2)

! launch 
    !$acc parallel loop gang collapse(2) private(p3_end) default(present) present(p5F_float_holder(:,:,:,:,iz),global_data1,global_data2) async(1)
    do p1 = 1, p1_end
        do p2 = 1, p2_end
            p3_end = global_data1(p2,p1)

            !$acc loop worker vector private(p5_float) collapse(2)
            do p3 = 1, p3_end
                do p4 = 1, p4_end
        !--- some computation ---!
                    p5F_float_holder(p4,p3,p2,p1,iz) = p5_float
                enddo
            enddo
        enddo
    enddo

    ! delete the "wait"
   
! Start secion B on a different queue
    !$acc parallel loop gang collapse(2) private(p3_end) default(present) present(p5B_float_holder(:,:,:,:,iz),global_data1,global_data2) async(2)
    do p1 = 1, p1_end
        do p2 = 1, p2_end
            p3_end = global_data1(p2,p1)
!--- Some computation ---!
                    p5B_float_holder(p4,p3,p2,p1,iz) = p5_float
                enddo
            enddo
        enddo
    enddo


! Assumption is that section C depends on A and B completing before starting
! Put on it's own queue, but wait on A and B
    !$acc parallel loop collapse(4) default(present) async(3) wait(1,2)
    do p1 = 1, p1_end
        do p2 = 1, p2_end
            do p6 = 1, p6_end
                do p4 = 1, p4_end
                    <a sequence of numerical operations>
                enddo
            enddo
        enddo
    enddo

! Assumption is that section C does not modify p4F and p5B so can be copied back
! while Section C is running on the device.  Please let me know if this is incorrect.
! While C is running start copying back p5F and p5B on there queues.

    !$acc enter data copyout(p5F_float_holder(:,:,:,:,iz)) async(1) 
    !$acc enter data copyout(p5B_float_holder(:,:,:,:,iz)) async(2)

! Add the final "wait" to have the host block until all queues have completed,  
! i.e. Section C finishes and the data has copied back to the host
! since there are no references on the data, the OpenACC runtime should release the memory back to the 
! pool allocator for use in the next iteration of the outer loop.
    !$acc wait
 
enddo

Again, I’m don’t have full insight to your program so my assumptions may be off. Please adjust as needed.

Hopefully this fixes the out-of-memory issues and possibly the node2 performance issue.

-Mat

Dear Mat,

Thank you for a prompt and detailed reply! I learned a few new things, such as the ability to use several streams in the wait() clause (such as wait(1,2)) and integrating them right into the parallel loop calls. I would be grateful if you could point to some materials discussing such advanced syntax for OpenACC.

I tried implementing your suggestion, and it made our execution a bit faster (not drastically but several % that are measurable). I think you understood our code structure quite well, though, I omitted a few key things in the pseudo-code that I thought were not relevant. So, before I discuss our findings further, let me provide a bigger version of the original pseudo-code that we had:

!$acc enter data copyin(global_data1,global_data2)
<some CPU operations>

do iz = 1, iz_end

    !$acc enter data copyin(p5F_float_holder(:,:,:,:,iz),p5F_start(:,:,:,:,iz)) async(2)

    ! section 1 - can run independently from others
    !$acc parallel loop collapse(4) default(present) async(1)
    do p1 = 1, p1_end
        do p2 = 1, p2_end
            do p6 = 1, p6_end
                do p4 = 1, p4_end

                    <a sequence of numerical operations>

                enddo
            enddo
        enddo
    enddo

    ! section 2 - requires completion of section 1
    !$acc parallel loop collapse(4) default(present) async(1)
    do p1 = 1, p1_end
        do p2 = 1, p2_end
            do p6 = 1, p6_end
                do p4 = 1, p4_end

                    <a sequence of numerical operations>

                enddo
            enddo
        enddo
    enddo

    ! section 3 (ex-Section A) - requires completion of section 2
    !$acc parallel loop gang collapse(2) private(p3_end) default(present) present(p5F_float_holder(:,:,:,:,iz), &
    !$acc& p5F_start(:,:,:,:,iz),global_data1,global_data2) async(1)
    do p1 = 1, p1_end
        do p2 = 1, p2_end
            p3_end = global_data1(p2,p1)

            !$acc loop worker vector private(p5_float) collapse(2)
            do p3 = 1, p3_end
                do p4 = 1, p4_end
                    p5_float = p5F_start(p4,p3,p2,p1,iz)
                    p5_end = global_data2(p3,p1)
                    !$acc loop seq
                    do p5 = 1, p5_end

                        <a sequence of numerical operations>
                        <result> = <a sequence of numerical operations>
                        p5_float = p5_float + <result>

                    enddo
                    p5F_float_holder(p4,p3,p2,p1,iz) = p5_float
                enddo
            enddo
        enddo
    enddo

    !$acc enter data copyin(p5B_float_holder(:,:,:,:,iz),p5B_start(:,:,:,:,iz)) async(2)
    !$acc wait

    ! section 4 (ex-Section B) - requires completion of section 2
    !$acc parallel loop gang collapse(2) private(p3_end) default(present) present(p5B_float_holder(:,:,:,:,iz),&
    !$acc& p5B_start(:,:,:,:,iz),global_data1,global_data2) async(1)
    do p1 = 1, p1_end
        do p2 = 1, p2_end
            p3_end = global_data1(p2,p1)

            !$acc loop worker vector private(p5_float) collapse(2)
            do p3 = 1, p3_end
                do p4 = 1, p4_end
                    p5_float = p5B_start(p4,p3,p2,p1,iz)
                    p5_end = global_data2(p3,p1)
                    !$acc loop seq
                    do p5 = p5_end,1,-1

                        <a sequence of numerical operations>
                        <result> = <a sequence of numerical operations>
                        p5_float = p5_float + <result>

                    enddo
                    p5B_float_holder(p4,p3,p2,p1,iz) = p5_float
                enddo
            enddo
        enddo
    enddo

    !$acc enter data copyout(p5F_float_holder(:,:,:,:,iz)) delete(p5F_start(:,:,:,:,iz)) async(3) finalize
    !$acc wait(1)

    ! section 5 (ex-Section C) - requires completion of sections 3 and 4
    !$acc parallel loop collapse(4) default(present) async(1)
    do p1 = 1, p1_end
        do p2 = 1, p2_end
            do p6 = 1, p6_end
                do p4 = 1, p4_end

                    <a sequence of numerical operations>

                enddo
            enddo
        enddo
    enddo

    !$acc enter data copyout(p5B_float_holder(:,:,:,:,iz)) delete(p5F_start(:,:,:,:,iz)) async(3) finalize

    ! small section 6 - requires completion of sections 3 and 4, independent from section 5
    !$acc parallel loop collapse(4) default(present) async(1)
    do p1 = 1, p1_end
        do p2 = 1, p2_end
            do p7 = 1, p7_end
                do p8 = 1, p8_end

                    <a sequence of numerical operations>

                enddo
            enddo
        enddo
    enddo

    ! small section 7 - requires completion of section 6
    !$acc parallel loop collapse(4) default(present) async(1)
    do p1 = 1, p1_end
        do p2 = 1, p2_end
            do p7 = 1, p7_end
                do p8 = 1, p8_end

                    <a sequence of numerical operations>

                enddo
            enddo
        enddo
    enddo

    ! small section 8 - requires completion of section 5
    !$acc parallel loop collapse(4) default(present) async(1)
    do p1 = 1, p1_end
        do p2 = 1, p2_end
            do p9 = 1, p9_end
                do p10 = 1, p10_end

                    <a sequence of numerical operations>

                enddo
            enddo
        enddo
    enddo

    ! small section 9 - requires completion of section 7
    !$acc parallel loop collapse(4) default(present) async(1)
    do p1 = 1, p1_end
        do p2 = 1, p2_end
            do p9 = 1, p9_end
                do p10 = 1, p10_end

                    <a sequence of numerical operations>

                enddo
            enddo
        enddo
    enddo

enddo

As you can see, I renamed the Section A, B and C parts into the numeral sections listed in order. So, we have a few relatively small loops before the heavy-weight sections 3, 4, and 5. This is where we wanted to overlap the copyin, thus hiding it under the kernel executions. Following the Section 5, we have a few more even smaller loops that have negligible contribution to the execution time. In addition, instead of having one array for reading and writing data in Sections 3 and 4, we actually have 2 separate arrays as shown in the code. I profiled the code using NSYS and a small problem and am providing a screenshot for your convenience:
image
As shown in the screenshot, due to not having a wait statement at the end of the iz loop, we were able to overlap the copyin of the data with the previous step, which offered a lot of performance. For larger problems that are our primary target, the size of the execution blocks is noticeably larger than the size of the data transfer so it works even better. I do not have a fresh profiling example as recently I started getting errors with NSYS generating reports for large problems. But here is an older one just to illustrate the difference:
image
Following your suggestion, I converted the code to the following pseudo-code format:

!$acc enter data copyin(global_data1,global_data2)
<some CPU operations>

do iz = 1, iz_end

    ! put the dependent "p5F" data on the same queue as Section A
    !$acc enter data copyin(p5F_float_holder(:,:,:,:,iz),p5F_start(:,:,:,:,iz)) async(1)

    ! Start copying the "p5B" data for section B here on queue 2 
   !$acc enter data copyin(p5B_float_holder(:,:,:,:,iz),p5B_start(:,:,:,:,iz)) async(2)

    ! section 1 - can run independently from others, overlaps with the copyin
    !$acc parallel loop collapse(4) default(present) async(3)
    do p1 = 1, p1_end
        do p2 = 1, p2_end
            do p6 = 1, p6_end
                do p4 = 1, p4_end

                    <a sequence of numerical operations>

                enddo
            enddo
        enddo
    enddo

    ! section 2 - requires completion of section 1, overlaps with the copyin
    !$acc parallel loop collapse(4) default(present) async(3)
    do p1 = 1, p1_end
        do p2 = 1, p2_end
            do p6 = 1, p6_end
                do p4 = 1, p4_end

                    <a sequence of numerical operations>

                enddo
            enddo
        enddo
    enddo

    ! section 3 (ex-Section A) - requires completion of section 2
    !$acc parallel loop gang collapse(2) private(p3_end) default(present) present(p5F_float_holder(:,:,:,:,iz), &
    !$acc& p5F_start(:,:,:,:,iz),global_data1,global_data2) async(1) wait(3)
    do p1 = 1, p1_end
        do p2 = 1, p2_end
            p3_end = global_data1(p2,p1)

            !$acc loop worker vector private(p5_float) collapse(2)
            do p3 = 1, p3_end
                do p4 = 1, p4_end
                    p5_float = p5F_start(p4,p3,p2,p1,iz)
                    p5_end = global_data2(p3,p1)
                    !$acc loop seq
                    do p5 = 1, p5_end

                        <a sequence of numerical operations>
                        <result> = <a sequence of numerical operations>
                        p5_float = p5_float + <result>

                    enddo
                    p5F_float_holder(p4,p3,p2,p1,iz) = p5_float
                enddo
            enddo
        enddo
    enddo

    ! section 4 (ex-Section B) - requires completion of section 2
    !$acc parallel loop gang collapse(2) private(p3_end) default(present) present(p5B_float_holder(:,:,:,:,iz), &
    !$acc& p5B_start(:,:,:,:,iz),global_data1,global_data2) async(2) wait(3)
    do p1 = 1, p1_end
        do p2 = 1, p2_end
            p3_end = global_data1(p2,p1)

            !$acc loop worker vector private(p5_float) collapse(2)
            do p3 = 1, p3_end
                do p4 = 1, p4_end
                    p5_float = p5B_start(p4,p3,p2,p1,iz)
                    p5_end = global_data2(p3,p1)
                    !$acc loop seq
                    do p5 = p5_end,1,-1

                        <a sequence of numerical operations>
                        <result> = <a sequence of numerical operations>
                        p5_float = p5_float + <result>

                    enddo
                    p5B_float_holder(p4,p3,p2,p1,iz) = p5_float
                enddo
            enddo
        enddo
    enddo

    ! section 5 (ex-Section C) - requires completion of sections 3 and 4
    !$acc parallel loop collapse(4) default(present) async(3) wait(1,2)
    do p1 = 1, p1_end
        do p2 = 1, p2_end
            do p6 = 1, p6_end
                do p4 = 1, p4_end

                    <a sequence of numerical operations>

                enddo
            enddo
        enddo
    enddo

    ! While section 5 is running start copying back p5F and p5B on there queues.
    !$acc exit data copyout(p5F_float_holder(:,:,:,:,iz)) delete(p5F_start(:,:,:,:,iz)) async(1) 
    !$acc exit data copyout(p5B_float_holder(:,:,:,:,iz)) delete(p5B_start(:,:,:,:,iz)) async(2)

    ! small section 6 - requires completion of sections 3 and 4, independent from section 5
    !$acc parallel loop collapse(4) default(present) async(2)
    do p1 = 1, p1_end
        do p2 = 1, p2_end
            do p7 = 1, p7_end
                do p8 = 1, p8_end

                    <a sequence of numerical operations>

                enddo
            enddo
        enddo
    enddo

    ! small section 7 - requires completion of section 6
    !$acc parallel loop collapse(4) default(present) async(2)
    do p1 = 1, p1_end
        do p2 = 1, p2_end
            do p7 = 1, p7_end
                do p8 = 1, p8_end

                    <a sequence of numerical operations>

                enddo
            enddo
        enddo
    enddo

    ! small section 9 - requires completion of section 7
    !$acc parallel loop collapse(4) default(present) async(2) 
    do p1 = 1, p1_end
        do p2 = 1, p2_end
            do p9 = 1, p9_end
                do p10 = 1, p10_end

                    <a sequence of numerical operations>

                enddo
            enddo
        enddo
    enddo

    ! small section 8 - requires completion of section 5
    !$acc parallel loop collapse(4) default(present) async(1) wait(3)
    do p1 = 1, p1_end
        do p2 = 1, p2_end
            do p9 = 1, p9_end
                do p10 = 1, p10_end

                    <a sequence of numerical operations>

                enddo
            enddo
        enddo
    enddo

    ! the final wait to let the OpenACC runtime release the memory back to the 
    ! pool allocator for use in the next iteration of the outer loop.
    !$acc wait

enddo

The profiling screenshot for the new code is below:
image
However, despite having the !$acc wait clause at the end, the occupied memory after the iz loop was just as in our code when we do not use the NV_ACC_MEM_MANAGE flag. The tests were performed at the node1 that had no problems with memory freeing. Actually, I tested it with and without the flag and found no performance difference at all, the only difference being the increased memory occupancy.

So I hope you could help us further by answering the following questions:

  1. From what I understood, using the finalize flag in our code is not needed because we have matching copyin and copyout clauses. But could you provide an example where it is actually needed?
  2. Could you please have a look at the updated pseudo-code again and point any apparent flaws that could lead to the memory overflow without using the NV_ACC_MEM_MANAGE flag or possibly penalize the performance?
  3. As our loops have many levels, would there be any benefit from converting them and all data arrays to 1D format and use indexing for navigation? Or does OpenACC collapse the loops efficiently enough?
  4. As a side question, as we have the p5F_float_holder and p5F_start data arrays, we are considering copying the content of the p5F_start to p5F_float_holder prior to running the GPU part, thus reducing the time for copying the data and potentially the GPU memory occupancy. But is there a way to perform such host-to-host copying operation within the iz loop asynchronously to the GPU execution so that we could hide the delay? Both of the arrays are pinned.
  5. If you have any other ideas how we could solve the out of memory problem when not using the NV_ACC_MEM_MANAGE flag, please let us know what should we test or check.

Thank you so much for your help!

I’m sure not there’s anything out there that discusses using the wait clause and dependency graphs. I see Jeff Larkin does cover the wait clause a bit in Chapter 10 of “OpenACC for Programmers”, though no more in depth than what I wrote above.

Long ago when I did full day trainings, I’d include it, but since most current trainings needed to be shortened. Async and the wait directive as still covered, but those are widely used.
`

  1. From what I understood, using the finalize flag in our code is not needed because we have matching copyin and copyout clauses. But could you provide an example where it is actually needed?

Sorry, I don’t have anything. The only place I can think that it’s needed is in some type of clean-up or exception handling code when you don’t know how many nest data regions a variable was used in.

  1. Could you please have a look at the updated pseudo-code again and point any apparent flaws that could lead to the memory overflow without using the NV_ACC_MEM_MANAGE flag or possibly penalize the performance?

I’m not seeing anything that jumps out at me. Can you try setting the environment variable “NV_ACC_SYNCHRONOUS=1”? This disables async so we can see if the issue has to do with async or something else.

  1. As our loops have many levels, would there be any benefit from converting them and all data arrays to 1D format and use indexing for navigation? Or does OpenACC collapse the loops efficiently enough?

Collapse should be fine. You have the stride-1 dimension corresponding to the inner most vector loop so the memory accesses should be coalesced. If this was C/C++ using pointers to pointers, then it might be worthwhile, but not in Fortran.

  1. As a side question, as we have the p5F_float_holder and p5F_start data arrays, we are considering copying the content of the p5F_start to p5F_float_holder prior to running the GPU part, thus reducing the time for copying the data and potentially the GPU memory occupancy. But is there a way to perform such host-to-host copying operation within the iz loop asynchronously to the GPU execution so that we could hide the delay? Both of the arrays are pinned.

In other words, you want copy operation to be non-blocking on the host? To do that you’d need to fork a new host thread to handle the operation using something like an OpenMP task with the nowait clause.

Though another possibility is to use a device memcpy call using p5F_start’s host copy as the src and the device copy of p5F_float_holder as the dest. You’d need to put “p5F_float_holder” in a “enter data create” beforehand so the device copy is there, then call “acc memcpy to device”. However, this is blocking. For non-blocking you’d need to add in a bit of CUDA Fortran and use "cudaMemcpyAsync".

Note that the OpenACC spec only defines “acc_memcpy_to_device” as a C/C++ routine, but we have an interface for it in the OpenACC module, i.e. “use openacc”.

For CUDA Fortran add “use cudafor” and add the “-cuda” flag to the compile and link flags.

  1. If you have any other ideas how we could solve the out of memory problem when not using the NV_ACC_MEM_MANAGE flag, please let us know what should we test or check.

Unfortunately I don’t other than confirm that it’s due to async (see above). If it is, I’ll need to do some experimenting and talk with engineering for ideas.

Dear Mat,

Thank you for your help and replies. We will try to apply the non-blocking operation on the host side using your suggestion.

I was able to fix the memory occupancy problem by adjusting the GPU code that we run prior to the presented pseudo-code construct. So both our original code and the modified version after your suggestions now occupy the same amount of GPU memory as with using the NV_ACC_MEM_MANAGE flag.