I’ve been experimenting with the async directive in OpenACC.
The goal of the following code is to asynchronously initialize new arrays on the host while performing an arbitrary calculation on the device using a previously initialized array.
I use acc_async_test immediately following the parallel region. In theory this call should return 0 since control is passed back to the host before the parallel region finishes.
However, when I run my code, control doesn’t return to the host until after the parallel region finishes. In the output where I print the result of acc_async_test, the value is always 1.
Is there something I am misunderstanding about how to correctly use async?
I am compiling with
pgcc++ source.cpp -acc -Minfo=all -ta-tesla:cc35
Here is my example code
#define L (10)
#define M (5000)
#define N (5000)
#define BUSY (1000)
#include <iostream>
#include <stdlib.h>
#include <openacc.h>
static int busy_loops = BUSY;
static float variance = 8192.0;
void init(float (*A)[N])
{
int i,j;
for (j = 0; j < M; j++)
{
for (i = 0; i < N; i++)
{
A[j][i] = (float)rand()/(float)(RAND_MAX/variance);
}
}
}
int main()
{
int i,j,k,spin,status;
static float S[M][N];
static float A[L][M][N];
static float B[L][M][N];
init(S);
#pragma acc data copyin(S[0:M][0:N]), create(B[0:L][0:M][0:N]) async(0)
for (k = 0; k < L; k++)
{
init(A[k]);
std::cout << "Array " << k << " initialized" << std::endl;
#pragma acc data copyin(A[k:1][0:M][0:N]) async(k)
#pragma acc parallel async(k)
{
#pragma acc loop gang
for (j = 0; j < M; j++)
{
#pragma acc loop vector
for (i = 0; i < N; i++)
{
// here I tried to prolong the time spent on the device
for (spin = 0; spin < busy_loops; spin++)
{
B[k][j][i] = (S[j][i] * A[k][j][i])/variance;
}
}
}
}
#pragma acc update self(B[k:1][0:M][0:N]) async(k)
status = acc_async_test(k);
std::cout << "Next array... async value for "
<< k << " = " << status << std::endl;
}
return 0;
}
When “acc_async_test” returns true (i.e. a non-zero value such as 1), it means that all asynchronous operations have completed on the given queue. This is the case here. For various implementation reasons (mostly due to the lack of a call back mechanism from the OS indicating when a transfer is finished), we have to block after an update self directive. The recommendation would be to move the update self directive to it’s own loop.
I’d also recommend moving the creation of “A” to the outer data region and then use an update directive to synchronize the values. This will create a single block of memory rather that many smaller blocks. Also when creating a sub-array, “A[k:1]…”, in order to preserve proper indexing on the device, the runtime must create all preceding elements, “A[0:k]…”, even if they aren’t going to be accessed. Hence, you’ll be wasting a lot of memory creating many copies of A on the device.
% cat async.cpp
#define L (10)
#define M (5000)
#define N (5000)
#define BUSY (1000)
#include <iostream>
#include <stdlib.h>
#include <openacc.h>
static int busy_loops = BUSY;
static float variance = 8192.0;
void init(float (*A)[N])
{
int i,j;
for (j = 0; j < M; j++)
{
for (i = 0; i < N; i++)
{
A[j][i] = (float)rand()/(float)(RAND_MAX/variance);
}
}
}
int main()
{
int i,j,k,spin,status;
static float S[M][N];
static float A[L][M][N];
static float B[L][M][N];
init(S);
#pragma acc enter data copyin(S[0:M][0:N]), create(B[0:L][0:M][0:N]) create(A[0:L][0:M][0:N])
for (k = 0; k < L; k++)
{
init(A[k]);
std::cout << "Array " << k << " initialized" << std::endl;
#pragma acc update device (A[k:1][0:M][0:N]) async(k+1)
#pragma acc parallel present(A,B,S) async(k+1)
{
#pragma acc loop gang
for (j = 0; j < M; j++)
{
#pragma acc loop vector
for (i = 0; i < N; i++)
{
// here I tried to prolong the time spent on the device
for (spin = 0; spin < busy_loops; spin++)
{
B[k][j][i] = (S[j][i] * A[k][j][i])/variance;
}
}
}
}
status = acc_async_test(k+1);
std::cout << "Next array... async value for "
<< k << " = " << status << std::endl;
}
for (k = 0; k < L; k++)
{
#pragma acc update self(B[k:1][0:M][0:N]) async(k+1)
}
#pragma acc wait
#pragma acc exit data delete(S[0:M][0:N],B[0:L][0:M][0:N],A[0:L][0:M][0:N])
std::cout << "B[1][1][1] = " << B[1][1][1] << std::endl;
return 0;
}
% pgc++ -acc -Minfo=accel async.cpp; a.out
main:
40, Generating enter data create(A[:][:][:],B[:][:][:])
Generating enter data copyin(S[:][:])
46, Generating update device(A[k][:][:])
Generating present(A[:][:][:],S[:][:],B[:][:][:])
Accelerator kernel generated
Generating Tesla code
48, #pragma acc loop gang /* blockIdx.x */
51, #pragma acc loop vector(128) /* threadIdx.x */
54, #pragma acc loop seq
51, Loop is parallelizable
54, Loop carried reuse of B prevents parallelization
65, Generating update self(B[k][:][:])
72, Generating exit data delete(A[:][:][:],S[:][:],B[:][:][:])
Array 0 initialized
Next array... async value for 0 = 0
Array 1 initialized
Next array... async value for 1 = 0
Array 2 initialized
Next array... async value for 2 = 0
Array 3 initialized
Next array... async value for 3 = 0
Array 4 initialized
Next array... async value for 4 = 0
Array 5 initialized
Next array... async value for 5 = 0
Array 6 initialized
Next array... async value for 6 = 0
Array 7 initialized
Next array... async value for 7 = 0
Array 8 initialized
Next array... async value for 8 = 0
Array 9 initialized
Next array... async value for 9 = 0
B[1][1][1] = 76.0432