Overlapping kernels with hard disk writes - many possibilities...

Hi all. I have an MPI+CUDA multi-GPU code that I’m developing. Its essentially one big time loop containing some kernels. After a given number of timesteps, I copy some data from device to host, and then I write that data from host to hard disk. I want to overlap the writing of the data with subsequent kernel executions. I can think of three ways of doing this, and I’d like some comments as to which way is best, which way(s) are impossible, and of course any methods I haven’t thought of. Here are my ways:

  1. Double the number of MPI threads; for N cards do if( mpi_tidx < N) {kernels, memcpys} else{write}. Of course, this will contain the appropriate MPI blocking and CUDA synchronization calls where necessary. In this case, one thing I would really like is for the end user to only worry about the number of cores they want for computation – is it possible to take, say, -np 8 and generate 16 threads?

  2. Use streams. I don’t know much about streams. Can they even be used on the host like this?

  3. Use pthreads. Is it possible to have a combined MPI-CUDA-POSIX code?

Clearly, my first idea is the best developed. If 2 or 3 were possible, would there be any advantages?

Thanks.

Another possibility is to use OpenMP

I tried my OpenMP idea, and it didn’t work, but that may be my poor OpenMP programming skills. Here’s what I did:

omp_set_num_threads(2);

#omp parallel private( time)

{

  if( omp_get_thread_num() == 0)

  {

    for( time=0; time < final; time++)

    {

      kernel <<<>>>();

if (!(time%N))

      {

        #omp barrier

cudaMemcpy(array_h, array_d, size, cudaMemcpyDeviceToHost);

#omp barrier

      }

    }

  }

  else

  {

    for( frame = 0; frame < N; frame++)

    {

      #omp barrier

      #omp barrier

      write_to_file( array_h)

    }

  }

}

I get the compile warning “warning: barrier region may not be closely nested inside of work-sharing, critical, ordered, master or explicit task region”. I ignored it and the code ran, but took 50% longer and didn’t produce all of the frames.