Simultaneous computation on GPU, data copy and file writing

szhuang5_wisc · April 18, 2022, 5:13am

Hi,

I have a code for solving time-dependent differential equation and the computation is accelerated by GPU with OpenACC directives. The code is like this:

double x[n]; //vector x with length n

double y[n]; //vector y with length n

//Create and copy x and y on GPU

#pragma acc enter data copyin(x[0:n])

#pragma acc enter data copyin(y[0:n])

//Iterations to do time-marching of the differential equation

for (step = 0; step < total_step; ++step)

{

//Some parallelized computation on GPU

#pragma acc parallel default(present)

{

#pragma acc loop gang vector

for (i = 0; i < n; i++) {

x[i] = x[i] + f(x,y); //Some computation on x

y[i] = y[i] + g(x,y); //Some computation on y

}

//I need to write x and y to file every 100 iterations

if (step % 100 == 0) {

#pragma acc update self(x[0:n], y[0:n]) // Copy back to host

#pragma acc wait //Wait until the copying finish

for (i = 0; i < n; i++) {

fprintf(file_name,…,x[i],y[i]); //Write x and y to file

}

I noticed that, every 100 iterations, the computation of x and y on GPU will not start until the file writing finishes. My thought is, because the x and y have been copied back to host, the file writing on the CPU side and the computation of x and y on GPU for the next or even next 100 iterations can do at the same time, which is more efficient. But I don’t know how to realize it. Is there a way to do so? Should I create two threads on host side?

And another question: to make code more efficient, is it possible to let the data copying from GPU to host and the computation on GPU for next iterations happen at the same time without changing the x and y that the host should receive?

Thanks in advance.

MatColgrove · April 18, 2022, 5:20pm

Is it possible to let the data copying from GPU to host and the computation on GPU for next iterations happen at the same time without changing the x and y that the host should receive?

Sure. OpenACC has the “async” clause which will have copies be be performed asynchronously to the host. For details, see Jeff’s talk https://on-demand.gputechconf.com/gtc/2015/presentation/S5195-Jeff-Larkin.pdf starting at slide 38.

Since the kernels depend on “x” and “y” being fully copied before computing, the compute kernel will need to be on the same async queue to create the dependency between the copy and compute, or you’ll want to block your data to remove this dependency.

My thought is, because the x and y have been copied back to host, the file writing on the CPU side and the computation of x and y on GPU for the next or even next 100 iterations can do at the same time, which is more efficient. But I don’t know how to realize it. Is there a way to do so? Should I create two threads on host side?

The dependency here is purely on the host, i.e. the host thread can’t continue until it prints all the data to a file. Forking off a new thread to perform the file write might work, but I’d probably use temp arrays so you don’t have a race condition where “x” and “y” are being updated at the same time they are being written to the file.

-Mat

szhuang5_wisc · April 18, 2022, 6:31pm

Hi Mat,

Thank you very much for your reply!

I did not fully understand why using temp arrays can make host thread continue before it prints all the data to a file. Could you please explain more about this or give some references for details? Thanks!

Best,
Shihao

MatColgrove · April 18, 2022, 7:40pm

What allows for the main thread to continue would be the forking of a new thread via pthreads or C++ thread.

The temp array would be to ensure when the process forks the new host thread that the data being printed is not overwritten by the main thread.

szhuang5_wisc · April 18, 2022, 8:13pm

Got it. Thank you, Mat!

system · May 2, 2022, 8:13pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
have cpu & gpu computing concurrently? Legacy PGI Compilers	1	1693	January 15, 2013
Questions on incorrect results with openacc in GPU nvc, nvc++ and nvfortran	33	2519	December 4, 2023
Async Issue: Dual GPU Parallel Execution Runs Sequentially nvc, nvc++ and nvfortran	9	74	January 27, 2025
Data copies of the same size vary greatly in different program times nvc, nvc++ and nvfortran	2	332	July 7, 2023
Data transfer from device to host Legacy PGI Compilers	2	3772	July 10, 2014
OpenACC + CUDA implementation nvc, nvc++ and nvfortran	7	126	January 30, 2025
OpenACC: Best way to parallelize nested DO loops (continued) nvc, nvc++ and nvfortran	22	1753	March 28, 2023
Problem with porting the code to multiple GPUs Legacy PGI Compilers	5	663	September 28, 2020
Async wait in OpenACC Legacy PGI Compilers	3	2069	September 24, 2020
Inconsistent performance with !$acc exit data copyout finalize and NV_ACC_MEM_MANAGE environmental variable nvc, nvc++ and nvfortran	4	185	July 11, 2024

Simultaneous computation on GPU, data copy and file writing

Related topics