I have two questions on CUDA streams.
- In the “CUDA by example” book by Sanders and Kandrot it is indicated
that when using CUDA streams we should always use pinned host memory
to send data to the device and to receive data from the GPU, by using
cudaHostAlloc to allocate the host memory and cudaMemcpyAsync to
copy the data. However in various examples (such as the “batchCUBLAS”
example in the CUDA-7 samples provided by NVIDIA) no pinned memory
is used (malloc is used to allocate the host memory). So which is
the correct way of allocating host memory when using streams?
2.In the same book it is indicated that the stream ID is only associated
when copying the data or executing a kernel. In other words when allocating
the device memory we do not associate the device memory with the stream in
which it will be used. For example we could have the following sequence of commands:
Notice that the stream id (stream0) is associated with the device memory
only when invoking cudaMemcpyAsync. What is the implication of this in practice?
Does it mean that in a subsequent call I could use the same device memory pointed
to by dev_a with a different stream (say stream1) by invoking
example? Or is it the case that once an allocated device memory is used with a particular
stream we cannot use it with a different stream?
So far I have not been able to find a clear answer to these questions.
Any help would be appreciated