Overlapping kernel and data execution

sWienke · March 1, 2013, 10:19am

Hi,
with the OpenACC async clauses, I can execute kernels and update data asynchronously. Now, I want to use different streams (=integer expressions).

I have heard that using integer=0 as argument for the async clause, PGI interprets it as synchronous behavior (only integer values > 0 are asynchronous). Is that true?
With CUDA streams, it is said that if a call blocks, it blocks all other calls of the same type behind it (even in other streams) - where the call is either of type kernel or memcopy. I assume that is a hardware issue. So, is it also true for OpenACC?
Example code:

#pragma acc kernels async(1)
...
#pragma acc update async(1)
...
#pragma acc update async(2)

Issuing these operations, does it mean that the second update can only be executed after the first update? The reason would be: stream 1 is executed first since an operations was first issued here (and not in stream 2). So, the kernel execution will start, afterwards the first update. And since the second update has to wait until the first update did finish, it will be executed at last. In this case, everything actually serialized.
Second example

#pragma acc kernels async(1)
...
#pragma acc update async(1)
...
#pragma acc kernels async(2)
...
#pragma acc update async(2)

If I understand it right, here, only kernel execution in stream 2 and update in stream 1 could be really overlapped. Correct?
Bye, Sandra

sWienke · March 5, 2013, 10:08am

Any news on the issue (for Nvidia GPUs)?

MatColgrove · March 8, 2013, 10:28pm

Hi Sandra,

Sorry for the late reply. I needed some clarification from engineering. Here’s their response:

In PGI 12.x, async(0) is interpreted as synchronous. This is in conflict with OpenACC V2.0, so PGI 13.x uses async(-1) to mean synchronous. Any other integer value is allowed, including negative values. The current OpenACC runtime only uses 8 CUDA streams, so the values are mapped down to between 0:7. The mapping is a little more complex than taking just the lower 3 bits, but values that differ in only the lower 3 bits will get different streams.

As for your second question, the OpenACC 1.0 Spec states:

Two asynchronous activities with the same argument value will be executed on the device in the order they are encountered by the host process. Two asynchronous activities with different handle values may be executed on the device in any order relative to each other. If there are two or more host threads executing and sharing the same accelerator device, two asynchronous activities with the same argument value will execute on the device one after the other, though the relative order is not determined.

One thing to add is that “async” is asynchronous to the host code and doesn’t block until a “wait” directive is encountered. Hence, in the case of:

#pragma acc kernels async(1)
...
#pragma acc update async(1)
...
#pragma acc update async(2)

The second update could be executed before the first.

For the second example, the only guarantee is that the stream 1’s kernel will launch before stream 1’s update and stream 2’s kernel will launch before stream 2’s update.

Hope this helps,
Mat

Topic		Replies	Views
OpenACC async problem when using PGI compiler v13.9 or v14.1 Legacy PGI Compilers	3	5524	February 4, 2014
OpenACC "streams" on multicore Legacy PGI Compilers	3	2597	December 1, 2017
OpenACC async max number of streams Legacy PGI Compilers	0	4348	May 2, 2014
Async wait in OpenACC Legacy PGI Compilers	3	2054	September 24, 2020
async memcopy/kernel from different contexts overlaping operations from different contexts.. CUDA Programming and Performance	9	2949	December 18, 2008
Asynchronous call from CPU Legacy PGI Compilers	6	3769	August 18, 2016
Why Different Kernels in Different Streams Behave Nearly Serially While Same Kernels Overlap Perfectly? CUDA Programming and Performance cuda , kernel	6	56	March 16, 2025
asyncAPI sample question CUDA Programming and Performance	9	5041	December 18, 2007
OpenACC "pgaccelinfo" output: meaning of Async Engines nvc, nvc++ and nvfortran	5	457	November 30, 2022
Code works with PGI_ACC_DEBUG=1 but fails without it Legacy PGI Compilers	5	4144	October 19, 2017

Overlapping kernel and data execution

Related topics