confusions about CUDA streams

hlei · July 29, 2017, 9:16am

Hi,everyone,
I got some question about CUDA streams when learning this concept,
The following are my problems:

First, I want to know what the streams have realized is just the concurrent between the memory copy and the kernel execution, can it realize the concurrent between different kernels. Or in other word, can multi tasks parallelism be realized on GPU with streams (In my application, there is a for loop do a GPU function, there are about 6 kernels in this GPU function, if he resource is not sufficient, the use of streams will have no meanings? )

second, I am not so clear about the concept of the streams. I have seen the ‘simpleStreams’ in CUDA samples which launch 4 streams. does the following code mean that block the hole data to 4 piece for the 4 different streams to issue?

for (int k = 0; k < nreps; k++)
    {
        // asynchronously launch nstreams kernels, each operating on its own portion of data
        for (int i = 0; i < nstreams; i++)
        {
            init_array<<<blocks, threads, 0, streams[i]>>>(d_a + i *n / nstreams, d_c, niterations);
        }

        // asynchronously launch nstreams memcopies.  Note that memcopy in stream x will only
        //   commence executing when all previous CUDA calls in stream x have completed
        for (int i = 0; i < nstreams; i++)
        {
            checkCudaErrors(cudaMemcpyAsync(hAligned_a + i * n / nstreams, d_a + i * n / nstreams, nbytes / nstreams, cudaMemcpyDeviceToHost, streams[i]));
        }
    }

But how can I set the number of the streams in an application, in my previous view, I think I should set the number of streams according to the number of the loop in an application, means that there are n loops, there should be n streams.(one stream corresponding to one loop) I don’t know if this view is right. So now I just don’t know how to use streams in my application. And I had transfer the device data to device before the loop, So there may be no data transfer of H2D

Finally, when I run the application, the GPU load (about 10%)is low which show in GPU-Z, can this be higher with streams.
And I have used CULA library in my GPU function(of course there are some other kernels wrote by myself), can I use streams streams in this condition.(I have seen in CULA forum that someone said that there are streams in CULA function. Using of streams may reduce the performance. )

The above are the questions that got me confused, hope you can give me some suggestions~

By the way, if there is any reference about streams ,please let me know.(I can’t understand it all according to the programming guide)

Thanks in advance

Robert_Crovella · July 29, 2017, 12:58pm

[url]http://on-demand.gputechconf.com/gtc/2014/presentations/S4158-cuda-streams-best-practices-common-pitfalls.pdf[/url]

hlei · July 29, 2017, 2:41pm

OK,first thanks your reply
But I still confused about the CULA with streams and the number setting of streams in one application, may you give some information about these two~

And if there is no sufficient resource(such as block capacity .etc ), does the streams bring little performance enhance like ‘concurrentKernels’ and ‘Hyper-Q’.

thx

Robert_Crovella · July 29, 2017, 2:58pm

streams can still be used for overlap of copy and compute - even with only a single kernel. You’ll be given examples of this if you look at the slide deck I linked, such as slide 3. There is only 1 kernel running at any given point in time there, but the overlap of copy and compute can give a performance improvement over a naive code.

I would argue that this usage of streams is more important than the usage of streams for enabling concurrent kernels - for the reasons already mentioned (concurrent kernels are hard to witness in practice).

It’s not possible to arrange for the overlap of copy and compute without using streams.

hlei · July 29, 2017, 3:28pm

May be my expression is not so clear,
what I want to know is that what should I depend on when set the number of the streams in one application.

And indeed I have seen the the slide deck which you have linked, there are 4+ way concurrency of memory copy and kernel execution, this is hard to witness, right?

hlei · July 30, 2017, 3:26pm

Thanks a lot anyway,
I have seen a way that block the raw data to several blocks, and each stream deal with one data block.Finally done the with n streams.
[url]https://github.com/parallel-forall/code-samples/blob/master/series/cuda-cpp/overlap-data-transfers/async.cu[/url]

May be I have missed something, anyway ,I need understand this further.
Thank you for your attention once again

Topic		Replies	Views
multi task parallelization with cuda streams ? CUDA Programming and Performance	7	1457	September 14, 2017
Question on Stream, Connection and Performance CUDA Programming and Performance hw , cuda	6	1193	February 23, 2024
My streams are not running concurrently CUDA Programming and Performance	7	1775	March 6, 2018
Using CUDA to run many instances CUDA Programming and Performance	10	3368	April 1, 2012
streams in Multi-gpu system CUDA Programming and Performance	7	6033	May 23, 2017
CUDA and NPP Misc Issues CUDA Programming and Performance	6	1451	March 28, 2011
Cuda Streams and multiple processes CUDA Programming and Performance	1	1934	May 3, 2020
Problem using streams Can't get more than one stream to work CUDA Programming and Performance	3	4663	October 8, 2008
Cuda Streams for Concurrent Kernel Calls CUDA Programming and Performance	1	2237	October 26, 2016
Benefits (or lack thereof) of using CUDA streams for kernel concurrency CUDA Programming and Performance	5	934	March 17, 2021

confusions about CUDA streams

Related topics