Question about streams

Good afternoon everyone,

I have a question about the CUDA streams.

Suppose I have the following codelet,


cudaStream_t s;

some_kernel<<grid, block, 0, s>>;

cudaMemcpyAsync(dst, src, memSize, cudaMemcpyHostToDevice, 0);


Basically I launch the kernel with a stream other than the default stream, and I run the cudaMemcpyAsync() with the default stream, 0.

Does it mean that the cudaMemcpyAsync() cannot finish (i.e., the data has been copied to the device) until the kernel has completed, since the default stream is used in copying?



I found the following from the programming guide 2.3.

Two commands from different streams cannot run concurrently if either a pagelocked
host memory allocation, a device memory allocation, a device memory set, a
device ↔ device memory copy, or any CUDA command to stream 0 is called in between
them by the host thread.

I think this may answer the question I raised above. Can anyone from Nvidia confirm this? Tim? ^_^