present clause after copyin clause

Hello,
I have three questions:

  1. present clause after copyin clause:

I have tried your OpenACC example from this article Account Login | PGI with a little modification. I used present clause in the callee function because in main function, we aready copied data into global memory.

In main function:

     #pragma acc data copyin(x[0:n]) copy(y[0:n])
      {
      gettimeofday( &t0, NULL );
      saxpy( n, a, x, y );
      gettimeofday( &t1, NULL );
      }

Original callee g code:

       void saxpy( int n, float a, float x[], float y[] ){
          #pragma acc parallel loop pcopyin(x[0:n],a,n) pcopy(y[0:n])
           for( int i = 0; i < n; ++i )
               y[i] += a*x[i];
           #pragma acc wait
       }

Edited code with present clause instead of pcopyin and pcopy:

void saxpy( int n, float a, float x[], float y[] ){
           #pragma acc parallel loop present(x[0:n],a,n) present(y[0:n])
           for( int i = 0; i < n; ++i )
               y[i] += a*x[i]; 
           #pragma acc wait
       }

The program was compiled successfully. But when I executed the program:

./saxpy_test_data
FATAL ERROR: data in PRESENT clause was not found on device 1: name=y
 file: <PATH>/openacc_pgc++/saxpy.cpp _Z5saxpyifPfS_ line:1

I think we already copied y into device memory in main function in main function? But, y was not available in device memory at runtime?

  1. How to deallocate memory which is allocated by copy/create clause?

  2. What is pgi_uacc_cuda_fill ? I have this information when I profiled my application.

Thank you very much,
Regards,
Minh

Hi Minh,

I think we already copied y into device memory in main function in main function? But, y was not available in device memory at runtime?

You need to do a few more things. First, expand the outer data region so it’s encompasses all calls to saxpy (you’re missing the first init call), add an “update” directive to update “y”'s value, remove “a” and “n” from the present clause since they aren’t present on the device, and finally, update the Makefile so that “pgc++ -acc” is used to compile the main program.

#pragma acc data copyin(x[0:n]) copy(y[0:n])
{
    saxpy( n, a, x, y );        // first time to initialize the system
    for( int i = 0; i < n; ++i ) y[i] = i*100;

#pragma acc update device(y[0:n])

    // now for real, time it
    gettimeofday( &t0, NULL );
    saxpy( n, a, x, y );
    gettimeofday( &t1, NULL );
}



  1. How to deallocate memory which is allocated by copy/create clause?

The device memory allocation and deallocation is handled automatically when a data or compute region is entered or exited.

  1. What is pgi_uacc_cuda_fill ? I have this information when I profiled my application.

It’s a PGI run time routine that performs an optimized device to device data copy.

  • Mat

Hi Mat,
Thanks a lot for your detail answer. I am still confused.

  1. We use “a” and “n” in this saxpy kernel. So, when we remove them from data copy clause, how could GPUs can compute the kernel?
  void saxpy( int n, float a, float x[], float y[] ){
           #pragma acc parallel loop present(x[0:n]) present(y[0:n])
           for( int i = 0; i < n; ++i )
               y[i] += a*x[i];
           #pragma acc wait
       }
  1. As I understand, the data in device memory is no longer accessible from outside of data/compute region. So, pcopy is only valid for variable which is allocated via acc_malloc?

The device memory allocation and deallocation is handled automatically when a data or compute region is entered or exited.

It’s a PGI run time routine that performs an optimized device to device data copy.

So, this involves two GPUs? Could you explain more about this?

Thank you,
Regards,
Minh