CUDA streams questions

astrophobia · May 17, 2015, 2:28pm

Hello

I have two questions on CUDA streams.

In the “CUDA by example” book by Sanders and Kandrot it is indicated
that when using CUDA streams we should always use pinned host memory
to send data to the device and to receive data from the GPU, by using
cudaHostAlloc to allocate the host memory and cudaMemcpyAsync to
copy the data. However in various examples (such as the “batchCUBLAS”
example in the CUDA-7 samples provided by NVIDIA) no pinned memory
is used (malloc is used to allocate the host memory). So which is
the correct way of allocating host memory when using streams?

2.In the same book it is indicated that the stream ID is only associated
when copying the data or executing a kernel. In other words when allocating
the device memory we do not associate the device memory with the stream in
which it will be used. For example we could have the following sequence of commands:

cudaHostAlloc((void**)&host_a,Nsizeof(int),cudaHostAllocDefault));
cudaMalloc((void**)&dev_a,Nsizeof(int));
cudaMemcpyAsync(dev_a,host_a,N*sizeof(int),cudaMemcpyHostToDevice,stream0);

Notice that the stream id (stream0) is associated with the device memory
only when invoking cudaMemcpyAsync. What is the implication of this in practice?
Does it mean that in a subsequent call I could use the same device memory pointed
to by dev_a with a different stream (say stream1) by invoking
cudaMemcpyAsync(dev_a,host_a,N*sizeof(int),cudaMemcpyHostToDevice,stream1) for
example? Or is it the case that once an allocated device memory is used with a particular
stream we cannot use it with a different stream?

So far I have not been able to find a clear answer to these questions.
Any help would be appreciated

Thanks

Robert_Crovella · May 17, 2015, 3:06pm

Pinned memory is required to overlap copy with compute. It is not sufficient merely to use the cudaMemcpyAsync API with a stream. Without pinned memory, any attempt to overlap copy with compute will fail (i.e. it will serialize).

Yes, you can use a given device memory allocation with any stream that is associated with that device.

The relevant section of the programming guide:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#concurrent-data-transfers

gives additional details. The fact that streams are associated with a particular device (which device? The one last selected by cudaSetDevice();) is indicated in the programming guide section on multi-device activity:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#stream-and-event-behavior

Topic		Replies	Views
Overlapping computation and data transfers must use pinned memory or UVA? CUDA Programming and Performance	1	648	August 13, 2018
Calling CudaHostAlloc (or cudaMallocHost) on a non-default stream CUDA Programming and Performance	2	1286	May 13, 2019
Can I create a pinned memory buffer to support overlapping compute/copy without cudaMallocHost overhead CUDA Programming and Performance cuda	13	1004	November 3, 2020
How to implement calculation pipeline via CUDA streams ? CUDA Programming and Performance	3	6643	January 17, 2013
cudaMemcpyAsync and pinned memory CUDA Programming and Performance	1	1153	August 31, 2021
Problems with cudaHostAlloc and cudaMemcpyAsync CUDA Programming and Performance	5	4601	February 8, 2010
Searching some infos on cudaStreams CUDA Programming and Performance	6	6205	January 26, 2012
mapped memory and streams CUDA Programming and Performance	3	2247	October 26, 2009
Some CUDA/GPU implementation related questions CUDA Programming and Performance	6	2344	May 30, 2009
Array used by all the streams CUDA Programming and Performance	1	570	December 18, 2014

CUDA streams questions

Related topics