Simultaneous computation on GPU, data copy and file writing

Hi,

I have a code for solving time-dependent differential equation and the computation is accelerated by GPU with OpenACC directives. The code is like this:

double x[n]; //vector x with length n

double y[n]; //vector y with length n

//Create and copy x and y on GPU

#pragma acc enter data copyin(x[0:n])

#pragma acc enter data copyin(y[0:n])

//Iterations to do time-marching of the differential equation

for (step = 0; step < total_step; ++step)

{

//Some parallelized computation on GPU

#pragma acc parallel default(present)

{

#pragma acc loop gang vector

for (i = 0; i < n; i++) {

x[i] = x[i] + f(x,y); //Some computation on x

y[i] = y[i] + g(x,y); //Some computation on y

}

}

//I need to write x and y to file every 100 iterations

if (step % 100 == 0) {

#pragma acc update self(x[0:n], y[0:n]) // Copy back to host

#pragma acc wait //Wait until the copying finish

for (i = 0; i < n; i++) {

fprintf(file_name,…,x[i],y[i]); //Write x and y to file

}

}

}

I noticed that, every 100 iterations, the computation of x and y on GPU will not start until the file writing finishes. My thought is, because the x and y have been copied back to host, the file writing on the CPU side and the computation of x and y on GPU for the next or even next 100 iterations can do at the same time, which is more efficient. But I don’t know how to realize it. Is there a way to do so? Should I create two threads on host side?

And another question: to make code more efficient, is it possible to let the data copying from GPU to host and the computation on GPU for next iterations happen at the same time without changing the x and y that the host should receive?

Thanks in advance.

Is it possible to let the data copying from GPU to host and the computation on GPU for next iterations happen at the same time without changing the x and y that the host should receive?

Sure. OpenACC has the “async” clause which will have copies be be performed asynchronously to the host. For details, see Jeff’s talk https://on-demand.gputechconf.com/gtc/2015/presentation/S5195-Jeff-Larkin.pdf starting at slide 38.

Since the kernels depend on “x” and “y” being fully copied before computing, the compute kernel will need to be on the same async queue to create the dependency between the copy and compute, or you’ll want to block your data to remove this dependency.

My thought is, because the x and y have been copied back to host, the file writing on the CPU side and the computation of x and y on GPU for the next or even next 100 iterations can do at the same time, which is more efficient. But I don’t know how to realize it. Is there a way to do so? Should I create two threads on host side?

The dependency here is purely on the host, i.e. the host thread can’t continue until it prints all the data to a file. Forking off a new thread to perform the file write might work, but I’d probably use temp arrays so you don’t have a race condition where “x” and “y” are being updated at the same time they are being written to the file.

-Mat

1 Like

Hi Mat,

Thank you very much for your reply!

I did not fully understand why using temp arrays can make host thread continue before it prints all the data to a file. Could you please explain more about this or give some references for details? Thanks!

Best,
Shihao

What allows for the main thread to continue would be the forking of a new thread via pthreads or C++ thread.

The temp array would be to ensure when the process forks the new host thread that the data being printed is not overwritten by the main thread.

Got it. Thank you, Mat!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.