OpenMP + OpenACC model

Hello,

Recently, i have a program using OpenACC to do offload work.

In my codes, I create an OpenACC device function to run in the parallel loop.

#pragma parallel loop deviceptr(A, B, C) 
{
    func(A, B, C);
}

The data are already copied to device and by using acc_deviceptr() runtime function, three device pointers A, B, C are obtained.

For some reason, i would like to use OpenMP doing multithreading work, for each one thread, an OpenACC kernel is distributed:

#pragma omp parallel num_threads()
{
       size_t tid = omp_get_thread_num();

       #pragma omp loop for 
       for (int i = 0; i < n; ++i) {
            #pragma acc parallel loop deviceptr(A, B, C) 
           {
                func(A, B, C);
           }
       }
}

My question is, does PGI V17 support this kind of work? If not, what the other solution? Such as follow:

#pragma acc parallel loop private(i)         // run on CPU
for(int i = 0; i<n; ++i)
{
       #pragma acc loop deviceptr(A, B, C) // run on GPU
       {
           func(A, B, C);
       }
}

All codes are writen by using OpenACC directives, but the outside parallel region is targeting for CPU multithreading, the inside loop is targeting for GPU kernel.

Sincerely,
Tao[/b]

Hi Tao,

Yes, this should work fine with the PGI 2017 compilers.

Are there dependencies on A, B, or C between the OpenMP threads? i.e. are the OpenMP threads working of different A, B, and C arrays?

If no, then I’d recommend adding an async clause on the OpenACC parallel construct so the GPU loops are run concurrently on the GPU.

Also, since there’s no loop around “func”, remove the “loop” clause and be sure to make “func” a “routine gang”.

Something like:

#pragma acc routine gang 
void func(float *, float *, float *);
...

#pragma omp parallel num_threads() 
{ 
       size_t tid = omp_get_thread_num(); 

       #pragma omp loop for 
       for (int i = 0; i < n; ++i) { 
            #pragma acc parallel deviceptr(A, B, C) async(tid)
           { 
                func(A, B, C); 
           } 
       } 
      #pragma acc wait
}

If you want to use multiple GPUs, be sure to set the device number for each OpenMP thread and that the device data is on the correct device number.

Finally, is there a reason why you’re using device pointers for A, B, and C instead of using data regions? It’s fine to do so, but does make your program less portable.

-Mat

Hi Mat,

Thanks for your reply.

Are there dependencies on A, B, or C between the OpenMP threads? i.e. are the OpenMP threads working of different A, B, and C arrays?

No, there are no dependencies on these arrays between the OpenMP,
I’ll try to adding an async clause for the loop.

As for your question:

Finally, is there a reason why you’re using device pointers for A, B, and C instead of using data regions?

Since in my code, A, B, C are very large scale data, and there are plenty of cycles calling for kernel function. So if i use data construct to transfer data between host and device, it’ll be very inefficient.

Maybe i can use data present(host pointer) as follow:

#pragma omp parallel num_threads() 
{ 
       size_t tid = omp_get_thread_num(); 

       #pragma omp loop for 
       for (int i = 0; i < n; ++i) { 
            #pragma acc data present(h_A,h_B,h_C) {
                 #pragma acc parallel loop
                { 
                    func(A, B, C); 
                }
            }
       } 
}

But i don’t see the difference of performance between using device pointers and data regions.

-Tao

Since in my code, A, B, C are very large scale data, and there are plenty of cycles calling for kernel function. So if i use data construct to transfer data between host and device, it’ll be very inefficient.

Granted, I don’t know your program, but I doubt using data regions higher in your code would be less efficient than using device pointers. Have you investigated using unstructured data regions?


Try something like:

float * A;
A=(float*) malloc(sizeof(float)*N);
...
#pragma acc enter data create (A[0:N],B[0:N],C[0:N])
...
// update the device data
#pragma acc update device(A[0:N])
...

#pragma omp parallel num_threads() 
{ 
       size_t tid = omp_get_thread_num(); 

       #pragma omp loop for 
       for (int i = 0; i < n; ++i) { 
                 #pragma acc parallel loop present(A,B,C)
                { 
                    func(A, B, C); 
                } 
       } 
}

-Mat