I am using pthreads to execute the same kernel on two C870s, but have allocated one thread to handle the writing to file of the results every timestep. I am using semaphores as a barrier for this, blocking the threads from proceeding until results have been written to file on the host, but I think there is a more elegant solution to this.
At the moment the code looks like this
[codebox]//each thread performs calculations for timestep i
//each thread cuMemcpys results in each device memory to host memory
//semaphores to block each process until all results gathered
//thread 0 writes result to file
//semaphores
//next time step i+1 calculations[/codebox]
Is there a way I can have the host printing results from time step i to file at the same time that GPUs are executing the kernels for timestep i+1, rather than waiting for the host to write the results from timestep i before executing the kernels for timestep i+1?
Just a guess, but what about a third thread, that writes the data from the memory to file and blocks the two executing threads before copying the data of the next step back to the host?
A run of this would somehow look like:
Thread1 Thread2 Thread3
Launch Kernel Launch kernel
for i for i
Wait for wait for
data written data written
copy data copy data
back to host back to host
sync threads sync threads
launch
thread 3
launch kernel launch kernel write data
for i+1 for i+1 from memory
to file
wait for wait for
data written data written
signal data
written to disk
copy data copy data
back to host back to host
...
Just leave 2 threads (each thread manages his own device). The code then has to be as follows:
void threadfn()
{
int i = init_i;
while( more-work-to-do ) {
launch_kernel( i );
if( i > init_i ) report_progress( i-1 );
wait_for_completion(); // i.e. cudaThreadSynchronize()
i++;
};
report_progress( i - 1 );
}
report_progress() should do all progress reporting as needed. You’ll probably need to add some critical section to it to avoid having this function to be called in two threads simultaneously.
I implemented a 3 pthread solution, with a pthread_barrier_t at the start of the timeloop for each thread, with threads 0 and 1 executing device code on separate GPUs and writing results from timestep T to host memory, and thread 2 writing that data in host memory from timestep T into a paraview file while the GPUs execute timestep T+1. Times? Before = 221 seconds. After = 151 seconds.
The moral of the story is to know and love pthreads.