CUDA streams and concurrency

Sofya · June 10, 2016, 8:59am

Hello everyone!

I have started working with cuda streams and keep getting very puzzling results. I’m testing the very basic examples of asynchronous methods (both can be found here How to Overlap Data Transfers in CUDA C/C++ | NVIDIA Technical Blog) and running them on GeForce GTX 760 (compute capability 3).

It the official documentation it says, the best overlap is for the cycle of the type
for (int i = 0; i < nStreams; ++i) {
int offset = i * streamSize;
checkCuda(cudaMemcpyAsync(&d_a[offset], &a[offset],
streamBytes, cudaMemcpyHostToDevice,
stream[i]));
kernel << <streamSize / blockSize, blockSize, 0, stream[i] >> >(d_a, offset);
checkCuda(cudaMemcpyAsync(&a[offset], &d_a[offset],
streamBytes, cudaMemcpyDeviceToHost,
stream[i]));
}

However, I’m not getting ANY overlap at all!

Moreover, I have tested the code on GeForce GTX 970 and it shows the best overlap for the cycle

{
int offset = i * streamSize;
checkCuda(cudaMemcpyAsync(&d_a[offset], &a[offset],
streamBytes, cudaMemcpyHostToDevice,
stream[i]));
}
for (int i = 0; i < nStreams; ++i)
{
int offset = i * streamSize;
kernel << <streamSize / blockSize, blockSize, 0, stream[i] >> >(d_a, offset);
}
for (int i = 0; i < nStreams; ++i)
{
int offset = i * streamSize;
checkCuda(cudaMemcpyAsync(&a[offset], &d_a[offset],
streamBytes, cudaMemcpyDeviceToHost,
stream[i]));
}

I’m not sure how to treat such a divergence with the official recommendations.
Does anybody know where could be the reason for it?

Many thanks for your time and help!

Sofya

BulatZiganshin · June 10, 2016, 4:29pm

sidenote: there is a CODE tag (last icon in the bar above edit box)

i’m not thoroughly checked your results, but HyperQ was added to CC 3.5, so GF670 is pretty the same as old Fermi devices

Topic		Replies	Views
Weird behaviour of CUDA streams CUDA Programming and Performance	0	1897	June 17, 2010
Strange behavior with overlap of transfer and compute CUDA Programming and Performance	4	3962	October 19, 2011
streams not overlapping CUDA Programming and Performance	1	1561	May 23, 2011
about streaming style sample code in Programming Guide ... why such a style? CUDA Programming and Performance	5	1433	January 23, 2009
Overlapping memcpyasync and kernel execution CUDA Programming and Performance	0	1094	July 28, 2008
Concurrent Kernel Execution / Memory Transfer We can't get it to work... CUDA Programming and Performance	5	4031	March 21, 2009
Conditions for CUDA streams to overlap CUDA Programming and Performance	5	4429	June 9, 2013
Overlapping CPU and GPU operations using streams. Total failure. Any help? CUDA Programming and Performance	6	6077	April 2, 2013
Kernel Queueing CUDA Programming and Performance	8	9694	June 29, 2009
Help with CUDA streams CUDA Programming and Performance	1	1616	April 2, 2010

CUDA streams and concurrency

Related topics