Async usage seems to be blocking

Hi,

I’ve been experimenting with the async directive in OpenACC.
The goal of the following code is to asynchronously initialize new arrays on the host while performing an arbitrary calculation on the device using a previously initialized array.
I use acc_async_test immediately following the parallel region. In theory this call should return 0 since control is passed back to the host before the parallel region finishes.

However, when I run my code, control doesn’t return to the host until after the parallel region finishes. In the output where I print the result of acc_async_test, the value is always 1.

Is there something I am misunderstanding about how to correctly use async?

I am compiling with

pgcc++ source.cpp -acc -Minfo=all -ta-tesla:cc35

Here is my example code

#define L (10) 
#define M (5000) 
#define N (5000) 

#define BUSY (1000) 

#include <iostream> 
#include <stdlib.h> 
#include <openacc.h> 

static int busy_loops = BUSY; 

static float variance = 8192.0; 

void init(float (*A)[N]) 
{ 
		int i,j; 

		for (j = 0; j < M; j++) 
		{ 
				for (i = 0; i < N; i++) 
				{ 
						A[j][i] = (float)rand()/(float)(RAND_MAX/variance); 
				} 
		} 
} 


int main() 
{ 
		int i,j,k,spin,status; 

		static float S[M][N]; 
		static float A[L][M][N]; 
		static float B[L][M][N]; 

		init(S); 

		#pragma acc data copyin(S[0:M][0:N]), create(B[0:L][0:M][0:N]) async(0)
		for (k = 0; k < L; k++) 
		{ 
				init(A[k]); 

				std::cout << "Array " << k << " initialized" << std::endl; 

				#pragma acc data copyin(A[k:1][0:M][0:N]) async(k) 
				#pragma acc parallel async(k) 
				{ 
						#pragma acc loop gang 
						for (j = 0; j < M; j++) 
						{ 
								#pragma acc loop vector 
								for (i = 0; i < N; i++) 
								{ 
										// here I tried to prolong the time spent on the device 
										for (spin = 0; spin < busy_loops; spin++) 
										{ 
												B[k][j][i] = (S[j][i] * A[k][j][i])/variance; 
										} 
								} 
						} 
				} 
				#pragma acc update self(B[k:1][0:M][0:N]) async(k) 

				status = acc_async_test(k); 

				std::cout << "Next array... async value for " 
						      << k << " = " << status << std::endl; 
		} 

		return 0; 
}

Hi dehsu,

When “acc_async_test” returns true (i.e. a non-zero value such as 1), it means that all asynchronous operations have completed on the given queue. This is the case here. For various implementation reasons (mostly due to the lack of a call back mechanism from the OS indicating when a transfer is finished), we have to block after an update self directive. The recommendation would be to move the update self directive to it’s own loop.

I’d also recommend moving the creation of “A” to the outer data region and then use an update directive to synchronize the values. This will create a single block of memory rather that many smaller blocks. Also when creating a sub-array, “A[k:1]…”, in order to preserve proper indexing on the device, the runtime must create all preceding elements, “A[0:k]…”, even if they aren’t going to be accessed. Hence, you’ll be wasting a lot of memory creating many copies of A on the device.

% cat async.cpp
#define L (10)
#define M (5000)
#define N (5000)

#define BUSY (1000)

#include <iostream>
#include <stdlib.h>
#include <openacc.h>

static int busy_loops = BUSY;

static float variance = 8192.0;

void init(float (*A)[N])
{
      int i,j;

      for (j = 0; j < M; j++)
      {
            for (i = 0; i < N; i++)
            {
                  A[j][i] = (float)rand()/(float)(RAND_MAX/variance);
            }
      }
}


int main()
{
      int i,j,k,spin,status;

      static float S[M][N];
      static float A[L][M][N];
      static float B[L][M][N];

      init(S);

      #pragma acc enter data copyin(S[0:M][0:N]), create(B[0:L][0:M][0:N]) create(A[0:L][0:M][0:N])
      for (k = 0; k < L; k++)
      {
            init(A[k]);
            std::cout << "Array " << k << " initialized" << std::endl;
            #pragma acc update device (A[k:1][0:M][0:N]) async(k+1)
            #pragma acc parallel present(A,B,S) async(k+1)
            {
                  #pragma acc loop gang
                  for (j = 0; j < M; j++)
                  {
                        #pragma acc loop vector
                        for (i = 0; i < N; i++)
                        {
                              // here I tried to prolong the time spent on the device
                              for (spin = 0; spin < busy_loops; spin++)
                              {
                                    B[k][j][i] = (S[j][i] * A[k][j][i])/variance;
                              }
                        }
                  }
            }
            status = acc_async_test(k+1);
            std::cout << "Next array... async value for "
                        << k << " = " << status << std::endl;
      }
      for (k = 0; k < L; k++)
      {
            #pragma acc update self(B[k:1][0:M][0:N]) async(k+1)

      }
      #pragma acc wait
      #pragma acc exit data delete(S[0:M][0:N],B[0:L][0:M][0:N],A[0:L][0:M][0:N])
      std::cout << "B[1][1][1] = " << B[1][1][1] << std::endl;
      return 0;
}
% pgc++ -acc -Minfo=accel async.cpp; a.out
main:
     40, Generating enter data create(A[:][:][:],B[:][:][:])
         Generating enter data copyin(S[:][:])
     46, Generating update device(A[k][:][:])
         Generating present(A[:][:][:],S[:][:],B[:][:][:])
         Accelerator kernel generated
         Generating Tesla code
         48, #pragma acc loop gang /* blockIdx.x */
         51, #pragma acc loop vector(128) /* threadIdx.x */
         54, #pragma acc loop seq
     51, Loop is parallelizable
     54, Loop carried reuse of B prevents parallelization
     65, Generating update self(B[k][:][:])
     72, Generating exit data delete(A[:][:][:],S[:][:],B[:][:][:])
Array 0 initialized
Next array... async value for 0 = 0
Array 1 initialized
Next array... async value for 1 = 0
Array 2 initialized
Next array... async value for 2 = 0
Array 3 initialized
Next array... async value for 3 = 0
Array 4 initialized
Next array... async value for 4 = 0
Array 5 initialized
Next array... async value for 5 = 0
Array 6 initialized
Next array... async value for 6 = 0
Array 7 initialized
Next array... async value for 7 = 0
Array 8 initialized
Next array... async value for 8 = 0
Array 9 initialized
Next array... async value for 9 = 0
B[1][1][1] = 76.0432

Hope this helps,
Mat

Ok that makes a lot of sense. Thanks for the reply and help Mat!