I have a kernel that integrates some variables and updates others, and they all need to be transferred from device to host asap.
So that they can be transferred asynchronously can I simply call cudamemcpyasynch with a default stream ID of zero, or do I need to create a unique stream for each variable and then call cudamemcpyasynch with each call referring to a unique stream ID?
Depends on what is supposed to run concurrently. If you want the copy to run in parallel to CPU code, cudaMemcpyAsync() is enough. But if you want the copy to execute parallel to a kernel on the GPU, you need to place them in different streams.
The situation is this. Several arrays of data size M are divided over N GPUs. The same kernel executes on the arrays of size M/N, then five variables in arrays of size M/N are copied from D2H, then five MPI_ALLgathers are called so that each process has the same copy of the five variables in arrays of size M, then those 5 arrays of size M are copied from H2D.
The CUDA SDK simpleStreams project uses streams to divide the data over the kernel, and I can understand why this could be faster. But how could the second mempcy from H2D be done asynch?
If your data is that easily partitioned into chunks that may be executed on different GPUs, you may as well partition it into more chunks and run then in different streams. That way you can (at least partially) overlap kernel execution from one stream and H<->D copies from other streams.
E.g., partition into NK arrays of size M/(NK), and execute K streams for each of the N GPUs.