CUDA streams and concurrency

Hello everyone!

I have started working with cuda streams and keep getting very puzzling results. I’m testing the very basic examples of asynchronous methods (both can be found here https://devblogs.nvidia.com/parallelforall/how-overlap-data-transfers-cuda-cc/) and running them on GeForce GTX 760 (compute capability 3).

It the official documentation it says, the best overlap is for the cycle of the type
for (int i = 0; i < nStreams; ++i) {
int offset = i * streamSize;
checkCuda(cudaMemcpyAsync(&d_a[offset], &a[offset],
streamBytes, cudaMemcpyHostToDevice,
stream[i]));
kernel << <streamSize / blockSize, blockSize, 0, stream[i] >> >(d_a, offset);
checkCuda(cudaMemcpyAsync(&a[offset], &d_a[offset],
streamBytes, cudaMemcpyDeviceToHost,
stream[i]));
}

However, I’m not getting ANY overlap at all!

Moreover, I have tested the code on GeForce GTX 970 and it shows the best overlap for the cycle

{
int offset = i * streamSize;
checkCuda(cudaMemcpyAsync(&d_a[offset], &a[offset],
streamBytes, cudaMemcpyHostToDevice,
stream[i]));
}
for (int i = 0; i < nStreams; ++i)
{
int offset = i * streamSize;
kernel << <streamSize / blockSize, blockSize, 0, stream[i] >> >(d_a, offset);
}
for (int i = 0; i < nStreams; ++i)
{
int offset = i * streamSize;
checkCuda(cudaMemcpyAsync(&a[offset], &d_a[offset],
streamBytes, cudaMemcpyDeviceToHost,
stream[i]));
}

I’m not sure how to treat such a divergence with the official recommendations.
Does anybody know where could be the reason for it?

Many thanks for your time and help!

Sofya

sidenote: there is a CODE tag (last icon in the bar above edit box)

i’m not thoroughly checked your results, but HyperQ was added to CC 3.5, so GF670 is pretty the same as old Fermi devices