Pinned memory for asynchronous data transfer with OpenACC

We have been working on some experiences with the pgi compiler pgi17.10 (community edition) on nvidia accelerators K80, to superpose data movements between host and device with computation of OpenACC kernels. Our objective is then to superpose data transfer from device to host with kernel computation. Globally the following openACC code should achieve this:

const int size = 10000000;
double sum = 0.0;
double *y = (double*) acc_malloc( size * sizeof( double ) );
double *x = (double*) malloc( size * sizeof( double ) );
#pragma acc enter data create ( x[0:size] )
#pragma acc enter data copyin ( size[0:1] )

#pragma acc parallel loop present( x, size ) deviceptr( y )
{
   for( int i = 0; i < size; i++ )
   {
      y[ i ] = i;
      x[ i ] = 2 * i;
   }
}

const int stream1 = 1;
const int stream2 = 2;
#pragma acc update self( x[0:size] ) async( stream1 )
#pragma acc parallel loop present( size ) deviceptr( y ) async( stream2 )
{
   for( int i = 0; i < size; i++ )
   sum += y[ i ];
}
acc_wait( stream2 );
acc_wait( stream1 );

When compiling this program with
pgc++ -acc -Minfo -o main main.cpp

And analysing with nvprof there is Not visible supperposition between this two tasks. Nevertheless when compiling with

pgc++ -ta=tesla:pinned -acc -Minfo -o main main.cpp

It is posible to observe the data movementd superposition with the kernels launch.

According to the pgi documentation, the compilation flag -ta=tesla:pinned makes All the memory allocation on the host to be page locked, the memory type used to transfer data between device and host.

Therefore 2 question rises:

  1. It is necessary that the host memory be page locked to achieve asynchronous data transference and computation on the device with the pgi compiler?
  2. Is there an alternative to allocate page locked memory with a coarse-grained control of the memory regions of interest with the pgi compiler? (something equivalent to cudaHostAlloc from nvcc).

Hi cconopoima,

Data transfers between the host and device must be done from pinned memory. Hence by default, the compiler’s runtime will use a pinned double buffering system where your data is first copied from virtual memory into the buffers, and then transferred to the device.

When using “pinned”, the host memory is allocate in pinned memory to begin with, so no buffers and no extra copy from virtual memory. The caveat is that physical memory is finite and more of a request to the OS so is not guaranteed. Also, the cost to allocate pinned memory is higher so shouldn’t be used if there are frequent allocations and frees.

Most likely what’s happening here is that since your kernel is small, it’s execution time is shorter than the time to copy the data to the buffers. So async is working as expected, but the imbalance in execution times is hiding this when not using pinned.

Can you try rewriting your loop to make it run longer?

Note that compute regions will block if data is returned from the device. Since you have a reduction, the compute region will need to block to wait for “sum” to be copied back. To solve, put “sum” in a data region and manually manage it’s movement.

Hope this helps,
Mat

Hi Mat Thanks for your reply

As a matter of fact, the kernel is small when compared with the time the data movement it takes to be perform when not using -ta=tesla:pinned. When visualizing on the pgprofiler the entire time that takes the data movement is way too long compared with the kernel time and no superposition between this two tasks is visible. It seams as if the

 #pragma acc update self( x[0:size] ) async( stream1 )

Blocks the openAcc kernel right after. Note that the memory regions where the update pragma acts (x) is different from the memory region where the openAcc kernel acts ( indicated with deviceptr( y ) ), so in principle no dependency between this operations should be induced.

As I mentioned in the original post, when using -ta=tesla:pinned, the memory movement as displayed by the pgprof is performed in a smaller time and there is a superposition with the openAcc kernel launched right after the update self directive.

Even though this technique including -ta=tesla:pinned improves the global computational time, it is unpractical to be used in a project where dynamic memory allocation is used. On the pgi documentation and the OpenACC standard there is nothing relative to this besides the compilation flag. I’m wondering if, it does exist an OpenACC directive to allow a coarse grained control for the creation of this pinned memory?

Not currently. We’ve had a few requests for more control over pinned memory, and I added your report to that request. FYI, it is FS#26655.

Hi brentl, Thanks for this information and for the inclusion on the request. Just one more thing, could you please tell me where to find this request (FS#26655)?

Thanks again.
Cesar

Sorry, it is currently just internal until the bug is fixed or feature is implemented, then it is documented in each release.