Pinned memory for asynchronous data transfer with OpenACC

cesar.conopoima · January 28, 2019, 12:34pm

We have been working on some experiences with the pgi compiler pgi17.10 (community edition) on nvidia accelerators K80, to superpose data movements between host and device with computation of OpenACC kernels. Our objective is then to superpose data transfer from device to host with kernel computation. Globally the following openACC code should achieve this:

const int size = 10000000;
double sum = 0.0;
double *y = (double*) acc_malloc( size * sizeof( double ) );
double *x = (double*) malloc( size * sizeof( double ) );
#pragma acc enter data create ( x[0:size] )
#pragma acc enter data copyin ( size[0:1] )

#pragma acc parallel loop present( x, size ) deviceptr( y )
{
   for( int i = 0; i < size; i++ )
   {
      y[ i ] = i;
      x[ i ] = 2 * i;
   }
}

const int stream1 = 1;
const int stream2 = 2;
#pragma acc update self( x[0:size] ) async( stream1 )
#pragma acc parallel loop present( size ) deviceptr( y ) async( stream2 )
{
   for( int i = 0; i < size; i++ )
   sum += y[ i ];
}
acc_wait( stream2 );
acc_wait( stream1 );

When compiling this program with
pgc++ -acc -Minfo -o main main.cpp

And analysing with nvprof there is Not visible supperposition between this two tasks. Nevertheless when compiling with

pgc++ -ta=tesla:pinned -acc -Minfo -o main main.cpp

It is posible to observe the data movementd superposition with the kernels launch.

According to the pgi documentation, the compilation flag -ta=tesla:pinned makes All the memory allocation on the host to be page locked, the memory type used to transfer data between device and host.

Therefore 2 question rises:

It is necessary that the host memory be page locked to achieve asynchronous data transference and computation on the device with the pgi compiler?
Is there an alternative to allocate page locked memory with a coarse-grained control of the memory regions of interest with the pgi compiler? (something equivalent to cudaHostAlloc from nvcc).

MatColgrove · January 29, 2019, 1:48pm

Hi cconopoima,

Data transfers between the host and device must be done from pinned memory. Hence by default, the compiler’s runtime will use a pinned double buffering system where your data is first copied from virtual memory into the buffers, and then transferred to the device.

When using “pinned”, the host memory is allocate in pinned memory to begin with, so no buffers and no extra copy from virtual memory. The caveat is that physical memory is finite and more of a request to the OS so is not guaranteed. Also, the cost to allocate pinned memory is higher so shouldn’t be used if there are frequent allocations and frees.

Most likely what’s happening here is that since your kernel is small, it’s execution time is shorter than the time to copy the data to the buffers. So async is working as expected, but the imbalance in execution times is hiding this when not using pinned.

Can you try rewriting your loop to make it run longer?

Note that compute regions will block if data is returned from the device. Since you have a reduction, the compute region will need to block to wait for “sum” to be copied back. To solve, put “sum” in a data region and manually manage it’s movement.

Hope this helps,
Mat

cesar.conopoima · January 31, 2019, 5:43pm

Hi Mat Thanks for your reply

As a matter of fact, the kernel is small when compared with the time the data movement it takes to be perform when not using -ta=tesla:pinned. When visualizing on the pgprofiler the entire time that takes the data movement is way too long compared with the kernel time and no superposition between this two tasks is visible. It seams as if the

 #pragma acc update self( x[0:size] ) async( stream1 )

Blocks the openAcc kernel right after. Note that the memory regions where the update pragma acts (x) is different from the memory region where the openAcc kernel acts ( indicated with deviceptr( y ) ), so in principle no dependency between this operations should be induced.

As I mentioned in the original post, when using -ta=tesla:pinned, the memory movement as displayed by the pgprof is performed in a smaller time and there is a superposition with the openAcc kernel launched right after the update self directive.

Even though this technique including -ta=tesla:pinned improves the global computational time, it is unpractical to be used in a project where dynamic memory allocation is used. On the pgi documentation and the OpenACC standard there is nothing relative to this besides the compilation flag. Iâ€™m wondering if, it does exist an OpenACC directive to allow a coarse grained control for the creation of this pinned memory?

brentl · February 4, 2019, 5:47pm

Not currently. We’ve had a few requests for more control over pinned memory, and I added your report to that request. FYI, it is FS#26655.

cesar.conopoima · February 6, 2019, 11:31am

Hi brentl, Thanks for this information and for the inclusion on the request. Just one more thing, could you please tell me where to find this request (FS#26655)?

Thanks again.
Cesar

brentl · February 6, 2019, 9:14pm

Sorry, it is currently just internal until the bug is fixed or feature is implemented, then it is documented in each release.

Topic		Replies	Views
marking some arrays as pinned Legacy PGI Compilers	4	4298	August 11, 2015
How to force synchronous send using OpenACC data construct Legacy PGI Compilers	1	1773	March 8, 2013
Different Performance by 13.xx ver. Legacy PGI Compilers	2	3464	June 11, 2013
OpenACC code with pinned memory Legacy PGI Compilers	4	4417	December 12, 2012
Overlapping acc enter data with computation Legacy PGI Compilers	5	5411	October 6, 2016
Error with pinned memory and threads on the host Legacy PGI Compilers	3	3776	August 10, 2017
Data copies of the same size vary greatly in different program times nvc, nvc++ and nvfortran	2	369	July 7, 2023
Host Pinned Memory Allocation Legacy PGI Compilers	1	3085	January 24, 2011
OpenACC + Pinned Memory + Segmenation fault Legacy PGI Compilers	1	3232	February 24, 2017
Inconsistent performance with !$acc exit data copyout finalize and NV_ACC_MEM_MANAGE environmental variable nvc, nvc++ and nvfortran	4	293	July 11, 2024

Pinned memory for asynchronous data transfer with OpenACC

Related topics