Hi,
with the OpenACC async clauses, I can execute kernels and update data asynchronously. Now, I want to use different streams (=integer expressions).
- I have heard that using integer=0 as argument for the async clause, PGI interprets it as synchronous behavior (only integer values > 0 are asynchronous). Is that true?
- With CUDA streams, it is said that if a call blocks, it blocks all other calls of the same type behind it (even in other streams) - where the call is either of type kernel or memcopy. I assume that is a hardware issue. So, is it also true for OpenACC?
Example code:
#pragma acc kernels async(1)
...
#pragma acc update async(1)
...
#pragma acc update async(2)
Issuing these operations, does it mean that the second update can only be executed after the first update? The reason would be: stream 1 is executed first since an operations was first issued here (and not in stream 2). So, the kernel execution will start, afterwards the first update. And since the second update has to wait until the first update did finish, it will be executed at last. In this case, everything actually serialized.
Second example
#pragma acc kernels async(1)
...
#pragma acc update async(1)
...
#pragma acc kernels async(2)
...
#pragma acc update async(2)
If I understand it right, here, only kernel execution in stream 2 and update in stream 1 could be really overlapped. Correct?
Bye, Sandra
Any news on the issue (for Nvidia GPUs)?
Hi Sandra,
Sorry for the late reply. I needed some clarification from engineering. Here’s their response:
In PGI 12.x, async(0) is interpreted as synchronous. This is in conflict with OpenACC V2.0, so PGI 13.x uses async(-1) to mean synchronous. Any other integer value is allowed, including negative values. The current OpenACC runtime only uses 8 CUDA streams, so the values are mapped down to between 0:7. The mapping is a little more complex than taking just the lower 3 bits, but values that differ in only the lower 3 bits will get different streams.
As for your second question, the OpenACC 1.0 Spec states:
Two asynchronous activities with the same argument value will be executed on the device in the order they are encountered by the host process. Two asynchronous activities with different handle values may be executed on the device in any order relative to each other. If there are two or more host threads executing and sharing the same accelerator device, two asynchronous activities with the same argument value will execute on the device one after the other, though the relative order is not determined.
One thing to add is that “async” is asynchronous to the host code and doesn’t block until a “wait” directive is encountered. Hence, in the case of:
#pragma acc kernels async(1)
...
#pragma acc update async(1)
...
#pragma acc update async(2)
The second update could be executed before the first.
For the second example, the only guarantee is that the stream 1’s kernel will launch before stream 1’s update and stream 2’s kernel will launch before stream 2’s update.
Hope this helps,
Mat