confusions about CUDA streams

Hi,everyone,
I got some question about CUDA streams when learning this concept,
The following are my problems:

First, I want to know what the streams have realized is just the concurrent between the memory copy and the kernel execution, can it realize the concurrent between different kernels. Or in other word, can multi tasks parallelism be realized on GPU with streams (In my application, there is a for loop do a GPU function, there are about 6 kernels in this GPU function, if he resource is not sufficient, the use of streams will have no meanings? )

second, I am not so clear about the concept of the streams. I have seen the ‘simpleStreams’ in CUDA samples which launch 4 streams. does the following code mean that block the hole data to 4 piece for the 4 different streams to issue?

for (int k = 0; k < nreps; k++)
    {
        // asynchronously launch nstreams kernels, each operating on its own portion of data
        for (int i = 0; i < nstreams; i++)
        {
            init_array<<<blocks, threads, 0, streams[i]>>>(d_a + i *n / nstreams, d_c, niterations);
        }

        // asynchronously launch nstreams memcopies.  Note that memcopy in stream x will only
        //   commence executing when all previous CUDA calls in stream x have completed
        for (int i = 0; i < nstreams; i++)
        {
            checkCudaErrors(cudaMemcpyAsync(hAligned_a + i * n / nstreams, d_a + i * n / nstreams, nbytes / nstreams, cudaMemcpyDeviceToHost, streams[i]));
        }
    }

But how can I set the number of the streams in an application, in my previous view, I think I should set the number of streams according to the number of the loop in an application, means that there are n loops, there should be n streams.(one stream corresponding to one loop) I don’t know if this view is right. So now I just don’t know how to use streams in my application. And I had transfer the device data to device before the loop, So there may be no data transfer of H2D

Finally, when I run the application, the GPU load (about 10%)is low which show in GPU-Z, can this be higher with streams.
And I have used CULA library in my GPU function(of course there are some other kernels wrote by myself), can I use streams streams in this condition.(I have seen in CULA forum that someone said that there are streams in CULA function. Using of streams may reduce the performance. )

The above are the questions that got me confused, hope you can give me some suggestions~

By the way, if there is any reference about streams ,please let me know.(I can’t understand it all according to the programming guide)

Thanks in advance

[url]http://on-demand.gputechconf.com/gtc/2014/presentations/S4158-cuda-streams-best-practices-common-pitfalls.pdf[/url]

OK,first thanks your reply
But I still confused about the CULA with streams and the number setting of streams in one application, may you give some information about these two~

And if there is no sufficient resource(such as block capacity .etc ), does the streams bring little performance enhance like ‘concurrentKernels’ and ‘Hyper-Q’.

thx

streams can still be used for overlap of copy and compute - even with only a single kernel. You’ll be given examples of this if you look at the slide deck I linked, such as slide 3. There is only 1 kernel running at any given point in time there, but the overlap of copy and compute can give a performance improvement over a naive code.

I would argue that this usage of streams is more important than the usage of streams for enabling concurrent kernels - for the reasons already mentioned (concurrent kernels are hard to witness in practice).

It’s not possible to arrange for the overlap of copy and compute without using streams.

May be my expression is not so clear,
what I want to know is that what should I depend on when set the number of the streams in one application.

And indeed I have seen the the slide deck which you have linked, there are 4+ way concurrency of memory copy and kernel execution, this is hard to witness, right?

Thanks a lot anyway,
I have seen a way that block the raw data to several blocks, and each stream deal with one data block.Finally done the with n streams.
[url]https://github.com/parallel-forall/code-samples/blob/master/series/cuda-cpp/overlap-data-transfers/async.cu[/url]

May be I have missed something, anyway ,I need understand this further.
Thank you for your attention once again