Hi,everyone,
I got some question about CUDA streams when learning this concept,
The following are my problems:
First, I want to know what the streams have realized is just the concurrent between the memory copy and the kernel execution, can it realize the concurrent between different kernels. Or in other word, can multi tasks parallelism be realized on GPU with streams (In my application, there is a for loop do a GPU function, there are about 6 kernels in this GPU function, if he resource is not sufficient, the use of streams will have no meanings? )
second, I am not so clear about the concept of the streams. I have seen the ‘simpleStreams’ in CUDA samples which launch 4 streams. does the following code mean that block the hole data to 4 piece for the 4 different streams to issue?
for (int k = 0; k < nreps; k++)
{
// asynchronously launch nstreams kernels, each operating on its own portion of data
for (int i = 0; i < nstreams; i++)
{
init_array<<<blocks, threads, 0, streams[i]>>>(d_a + i *n / nstreams, d_c, niterations);
}
// asynchronously launch nstreams memcopies. Note that memcopy in stream x will only
// commence executing when all previous CUDA calls in stream x have completed
for (int i = 0; i < nstreams; i++)
{
checkCudaErrors(cudaMemcpyAsync(hAligned_a + i * n / nstreams, d_a + i * n / nstreams, nbytes / nstreams, cudaMemcpyDeviceToHost, streams[i]));
}
}
But how can I set the number of the streams in an application, in my previous view, I think I should set the number of streams according to the number of the loop in an application, means that there are n loops, there should be n streams.(one stream corresponding to one loop) I don’t know if this view is right. So now I just don’t know how to use streams in my application. And I had transfer the device data to device before the loop, So there may be no data transfer of H2D
Finally, when I run the application, the GPU load (about 10%)is low which show in GPU-Z, can this be higher with streams.
And I have used CULA library in my GPU function(of course there are some other kernels wrote by myself), can I use streams streams in this condition.(I have seen in CULA forum that someone said that there are streams in CULA function. Using of streams may reduce the performance. )
The above are the questions that got me confused, hope you can give me some suggestions~
By the way, if there is any reference about streams ,please let me know.(I can’t understand it all according to the programming guide)
Thanks in advance