Hi all. I have an MPI+CUDA multi-GPU code that I’m developing. Its essentially one big time loop containing some kernels. After a given number of timesteps, I copy some data from device to host, and then I write that data from host to hard disk. I want to overlap the writing of the data with subsequent kernel executions. I can think of three ways of doing this, and I’d like some comments as to which way is best, which way(s) are impossible, and of course any methods I haven’t thought of. Here are my ways:
-
Double the number of MPI threads; for N cards do if( mpi_tidx < N) {kernels, memcpys} else{write}. Of course, this will contain the appropriate MPI blocking and CUDA synchronization calls where necessary. In this case, one thing I would really like is for the end user to only worry about the number of cores they want for computation – is it possible to take, say, -np 8 and generate 16 threads?
-
Use streams. I don’t know much about streams. Can they even be used on the host like this?
-
Use pthreads. Is it possible to have a combined MPI-CUDA-POSIX code?
Clearly, my first idea is the best developed. If 2 or 3 were possible, would there be any advantages?
Thanks.