How do I get host to write output while kernels execute on device?

I am using pthreads to execute the same kernel on two C870s, but have allocated one thread to handle the writing to file of the results every timestep. I am using semaphores as a barrier for this, blocking the threads from proceeding until results have been written to file on the host, but I think there is a more elegant solution to this.

At the moment the code looks like this

[codebox]//each thread performs calculations for timestep i

//each thread cuMemcpys results in each device memory to host memory

//semaphores to block each process until all results gathered

//thread 0 writes result to file


//next time step i+1 calculations[/codebox]

Is there a way I can have the host printing results from time step i to file at the same time that GPUs are executing the kernels for timestep i+1, rather than waiting for the host to write the results from timestep i before executing the kernels for timestep i+1?

Just a guess, but what about a third thread, that writes the data from the memory to file and blocks the two executing threads before copying the data of the next step back to the host?

A run of this would somehow look like:

Thread1		 Thread2		 Thread3

Launch Kernel   Launch kernel   

for i		   for i		   


Wait for		wait for		

data written	data written	


copy data	   copy data	   

back to host	back to host	


sync threads	sync threads	



thread 3						


launch kernel   launch kernel   write data

for i+1		 for i+1		 from memory				   

								to file

wait for		wait for					

data written	data written						



								signal data

								written to disk


copy data	   copy data

back to host	back to host			   



Yeah, that’s what I was thinking of doing.

IMO no need for third thread here.

Just leave 2 threads (each thread manages his own device). The code then has to be as follows:

void threadfn()


  int i = init_i;

while( more-work-to-do ) {

	launch_kernel( i );

	if( i > init_i ) report_progress( i-1 );

	wait_for_completion(); // i.e. cudaThreadSynchronize()



report_progress( i - 1 );


report_progress() should do all progress reporting as needed. You’ll probably need to add some critical section to it to avoid having this function to be called in two threads simultaneously.

I implemented a 3 pthread solution, with a pthread_barrier_t at the start of the timeloop for each thread, with threads 0 and 1 executing device code on separate GPUs and writing results from timestep T to host memory, and thread 2 writing that data in host memory from timestep T into a paraview file while the GPUs execute timestep T+1. Times? Before = 221 seconds. After = 151 seconds.

The moral of the story is to know and love pthreads.