Hi,

I have a code for solving time-dependent differential equation and the computation is accelerated by GPU with OpenACC directives. The code is like this:

double x[n]; //vector x with length n

double y[n]; //vector y with length n

//Create and copy x and y on GPU

#pragma acc enter data copyin(x[0:n])

#pragma acc enter data copyin(y[0:n])

//Iterations to do time-marching of the differential equation

for (step = 0; step < total_step; ++step)

{

//Some parallelized computation on GPU

#pragma acc parallel default(present)

{

#pragma acc loop gang vector

for (i = 0; i < n; i++) {

x[i] = x[i] + f(x,y); //Some computation on x

y[i] = y[i] + g(x,y); //Some computation on y

}

}

**//I need to write x and y to file every 100 iterations**

if (step % 100 == 0) {

#pragma acc update self(x[0:n], y[0:n]) // Copy back to host

#pragma acc wait //Wait until the copying finish

for (i = 0; i < n; i++) {

fprintf(file_name,â€¦,x[i],y[i]); //Write x and y to file

}

}

}

I noticed that, every 100 iterations, the computation of x and y on GPU will not start until the file writing finishes. My thought is, because the x and y have been copied back to host, the file writing on the CPU side and the computation of x and y on GPU for the next or even next 100 iterations can do at the same time, which is more efficient. But I donâ€™t know how to realize it. Is there a way to do so? Should I create two threads on host side?

And another question: to make code more efficient, is it possible to let the data copying from GPU to host and the computation on GPU for next iterations happen at the same time without changing the x and y that the host should receive?

Thanks in advance.