How do I get host to write output while kernels execute on device?

I am using pthreads to execute the same kernel on two C870s, but have allocated one thread to handle the writing to file of the results every timestep. I am using semaphores as a barrier for this, blocking the threads from proceeding until results have been written to file on the host, but I think there is a more elegant solution to this.

At the moment the code looks like this

[codebox]//each thread performs calculations for timestep i

//each thread cuMemcpys results in each device memory to host memory

//semaphores to block each process until all results gathered

//thread 0 writes result to file

//semaphores

//next time step i+1 calculations[/codebox]

Is there a way I can have the host printing results from time step i to file at the same time that GPUs are executing the kernels for timestep i+1, rather than waiting for the host to write the results from timestep i before executing the kernels for timestep i+1?

Just a guess, but what about a third thread, that writes the data from the memory to file and blocks the two executing threads before copying the data of the next step back to the host?

A run of this would somehow look like:

Thread1		 Thread2		 Thread3

Launch Kernel   Launch kernel   

for i		   for i		   

								

Wait for		wait for		

data written	data written	

								

copy data	   copy data	   

back to host	back to host	

								

sync threads	sync threads	

								

launch						  

thread 3						

								

launch kernel   launch kernel   write data

for i+1		 for i+1		 from memory				   

								to file

wait for		wait for					

data written	data written						

							

								

								signal data

								written to disk

							

copy data	   copy data

back to host	back to host			   

							

...

Yeah, that’s what I was thinking of doing.

IMO no need for third thread here.

Just leave 2 threads (each thread manages his own device). The code then has to be as follows:

void threadfn()

{

  int i = init_i;

while( more-work-to-do ) {

	launch_kernel( i );

	if( i > init_i ) report_progress( i-1 );

	wait_for_completion(); // i.e. cudaThreadSynchronize()

	i++;

  };

report_progress( i - 1 );

}

report_progress() should do all progress reporting as needed. You’ll probably need to add some critical section to it to avoid having this function to be called in two threads simultaneously.

I implemented a 3 pthread solution, with a pthread_barrier_t at the start of the timeloop for each thread, with threads 0 and 1 executing device code on separate GPUs and writing results from timestep T to host memory, and thread 2 writing that data in host memory from timestep T into a paraview file while the GPUs execute timestep T+1. Times? Before = 221 seconds. After = 151 seconds.

The moral of the story is to know and love pthreads.