Good afternoon everyone,
I have a question about the CUDA streams.
Suppose I have the following codelet,
some_kernel<<grid, block, 0, s>>;
cudaMemcpyAsync(dst, src, memSize, cudaMemcpyHostToDevice, 0);
Basically I launch the kernel with a stream other than the default stream, and I run the cudaMemcpyAsync() with the default stream, 0.
Does it mean that the cudaMemcpyAsync() cannot finish (i.e., the data has been copied to the device) until the kernel has completed, since the default stream is used in copying?