OpenACC and OpenMP data interoperability

Hello everyone!

I was testing the interoperability between OpenACC and OpenMP and more specifically, trying to allocate and copy memory to the device using and OpenACC and then using it on an OpenMP target region. Following some examples shown in one of the workshops about OpenMP I particiapted, I created the following example (compiled with NVHPC 21.5 nvc++ -mp=gpu -acc -cuda test.cpp):

#include <iostream>
#include <omp.h>
#include <openacc.h>
int main() {
    int N = 100000;
    int *x = new int[N];
    std::cout << "Number of devices : " << omp_get_num_devices() << std::endl;
    for (int i=0; i<N; ++i)
        x[i] = 1;
    #pragma acc data copy(x[0:N])
    {
        #pragma omp target loop
        for (int i=0; i<N; ++i)
            x[i] = 2;
    }
    for (int i=0; i<N; ++i)
        std::cout << x[i] << " ";
    std::cout << std::endl;
    delete [] x;
    return 0;
}

and observed the following behavior (output of nvprof --print-gpu-trace ./a.out ):

==23910== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream        Unified Memory  Virtual Address  Name
513.74ms  36.128us                    -               -         -         -         -  390.63KB  10.311GB/s      Pinned      Device  Tesla V100-SXM2         1         7                     -                -  [CUDA memcpy HtoD]
534.08ms         -                    -               -         -         -         -         -           -           -           -                -         -         -         PC 0xeb76216f   0x7fff32000000  [Unified Memory CPU page faults]
534.41ms  1.3760us                    -               -         -         -         -        8B  5.5446MB/s    Pageable      Device  Tesla V100-SXM2         1         7                     -                -  [CUDA memcpy HtoD]
535.64ms  2.6240us            (782 1 1)       (128 1 1)        16        0B        0B         -           -           -           -  Tesla V100-SXM2         1        17                     -                -  nvkernel_main_F1L14_1 [515]
535.69ms  31.936us                    -               -         -         -         -  390.63KB  11.665GB/s      Device      Pinned  Tesla V100-SXM2         1         7                     -                -  [CUDA memcpy DtoH]

I have 2 questions regarding the above:

  1. Is there really only one copy done from the Host to the Device?
  2. Where is the Unified Memory CPU page fault coming from?

I’ve also created a bit more complicated example were I copy again an array x to the GPU using acc_copyin and then I also associate another pointer to some part of the copied array.

#include <iostream>
#include <omp.h>
#include <openacc.h>
int main() {
    int N = 10;
    int *x = new int[N];
    std::cout << "Number of devices : " << omp_get_num_devices() << std::endl;
    for(int i = 0; i < N; i++) {
        x[i] = i;
    }
    int *x_dev = (int *)acc_copyin(x, N*sizeof(int));
    int* y;
    omp_target_associate_ptr(y, x_dev, 5*sizeof(int), (N-5)*sizeof(int), 0);
    #pragma omp target loop map(x[0:N],y[0:5])
    for (int i=0; i<N; ++i) {
        x[i] = 1;
        if(i<5) {
            y[i] += 1;
        }
    }
    #pragma acc exit data copyout(x[0:N])
    for (int i=0; i<N; ++i)
        std::cout << x[i] << " ";
    std::cout << std::endl;
    delete [] x;
    return 0;
}

The ouput of nvprof is:

srun -n 1 nvprof --print-gpu-trace ./a.out
==171271== NVPROF is profiling process 171271, command: ./a.out
Number of devices : 1
1 1 1 1 1 2 2 2 2 2
==171271== Profiling application: ./a.out
==171271== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream        Unified Memory  Virtual Address  Name
501.89ms  1.7280us                    -               -         -         -         -       40B  22.076MB/s      Pinned      Device  Tesla V100-SXM2         1         7                     -                -  [CUDA memcpy HtoD]
522.24ms         -                    -               -         -         -         -         -           -           -           -                -         -         -         PC 0xac26816f   0x2aaaf0000000  [Unified Memory CPU page faults]
522.57ms  1.3760us                    -               -         -         -         -        8B  5.5446MB/s    Pageable      Device  Tesla V100-SXM2         1         7                     -                -  [CUDA memcpy HtoD]
523.80ms  1.7280us              (1 1 1)        (32 1 1)        16        0B        0B         -           -           -           -  Tesla V100-SXM2         1        14                     -                -  nvkernel_main_F1L20_1 [157]
523.87ms  2.4320us                    -               -         -         -         -       40B  15.685MB/s      Device      Pinned  Tesla V100-SXM2         1         7                     -                -  [CUDA memcpy DtoH]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy

My questions in this example are:

  1. Is again only one HtoD copy of the x array really?
  2. Does OpenMP figure out automatically that the pointer x is associated to the x_dev pointer and is already present in the GPU memory using the OpenACC present table?
  3. Have I understood correctly that this is the proper usage and benefit of omp_target_associate_ptr, meaning that it’s used to associate another pointer to the same data existing on the GPU? It also seems to me that this is not needed for the x array pointer. Am I right?

Thank you very much in advance for your input. It would be very helpful for me to understand how the data transfers are taken care by OpenACC and OpenMP to apply changes in existing code.

Hi iomagkanaris,

I don’t believe we claim interoperability between OpenACC and OpenMP Target to GPUs, but the two models do share some of runtime, in particular data management, so this aspect should work ok. Though we haven’t thoroughly tested it so there may be issues we’re unaware. Normally I’d recommend sticking to one model or the other, but it sounds like you’re wanting to port an existing code from OpenACC to OpenMP and do it incrementally.

  1. Is there really only one copy done from the Host to the Device?

It appears so. The models share the same runtime data management so the device copy of “x” would be visible when the compiler does the present check upon entering the compute region.

  1. Where is the Unified Memory CPU page fault coming from?

Sorry, no idea. I don’t see it when I profile the code, but I’m using Nsight-Systems which doesn’t have the print-gpu-trace option. Possibly an artifact of the profiling?

  1. Is again only one HtoD copy of the x array really?

Since x and y both point to the same device memory, the present check will pass in the same device pointer for both. The copies only occur when you call “acc_copyin” and “exit data copyout”

  1. Does OpenMP figure out automatically that the pointer x is associated to the x_dev pointer and is already present in the GPU memory using the OpenACC present table?

They share the same present table so should work as expected in this case.

  1. Have I understood correctly that this is the proper usage and benefit of omp_target_associate_ptr , meaning that it’s used to associate another pointer to the same data existing on the GPU? It also seems to me that this is not needed for the x array pointer. Am I right?

I wouldn’t necessarily recommend mapping two host pointers to the same device address in the same kernel, as you do here, since it has the potential to introduce bugs, but it is a use case. The typical use case is to re-use device memory, i.e. create some device memory, map it to some host pointer, use it in a kernel, then map it to a different host pointer for another kernel thus re-using the device memory.

No, it’s not needed for “x” since this is already implicitly mapped as part of the acc_copyin call.

-Mat