CUDA streams questions


I have two questions on CUDA streams.

  1. In the “CUDA by example” book by Sanders and Kandrot it is indicated
    that when using CUDA streams we should always use pinned host memory
    to send data to the device and to receive data from the GPU, by using
    cudaHostAlloc to allocate the host memory and cudaMemcpyAsync to
    copy the data. However in various examples (such as the “batchCUBLAS”
    example in the CUDA-7 samples provided by NVIDIA) no pinned memory
    is used (malloc is used to allocate the host memory). So which is
    the correct way of allocating host memory when using streams?

2.In the same book it is indicated that the stream ID is only associated
when copying the data or executing a kernel. In other words when allocating
the device memory we do not associate the device memory with the stream in
which it will be used. For example we could have the following sequence of commands:


Notice that the stream id (stream0) is associated with the device memory
only when invoking cudaMemcpyAsync. What is the implication of this in practice?
Does it mean that in a subsequent call I could use the same device memory pointed
to by dev_a with a different stream (say stream1) by invoking
cudaMemcpyAsync(dev_a,host_a,N*sizeof(int),cudaMemcpyHostToDevice,stream1) for
example? Or is it the case that once an allocated device memory is used with a particular
stream we cannot use it with a different stream?

So far I have not been able to find a clear answer to these questions.
Any help would be appreciated


Pinned memory is required to overlap copy with compute. It is not sufficient merely to use the cudaMemcpyAsync API with a stream. Without pinned memory, any attempt to overlap copy with compute will fail (i.e. it will serialize).

Yes, you can use a given device memory allocation with any stream that is associated with that device.

The relevant section of the programming guide:

gives additional details. The fact that streams are associated with a particular device (which device? The one last selected by cudaSetDevice();) is indicated in the programming guide section on multi-device activity: