Hello everyone!
I have started working with cuda streams and keep getting very puzzling results. I’m testing the very basic examples of asynchronous methods (both can be found here https://devblogs.nvidia.com/parallelforall/how-overlap-data-transfers-cuda-cc/) and running them on GeForce GTX 760 (compute capability 3).
It the official documentation it says, the best overlap is for the cycle of the type
for (int i = 0; i < nStreams; ++i) {
int offset = i * streamSize;
checkCuda(cudaMemcpyAsync(&d_a[offset], &a[offset],
streamBytes, cudaMemcpyHostToDevice,
stream[i]));
kernel << <streamSize / blockSize, blockSize, 0, stream[i] >> >(d_a, offset);
checkCuda(cudaMemcpyAsync(&a[offset], &d_a[offset],
streamBytes, cudaMemcpyDeviceToHost,
stream[i]));
}
However, I’m not getting ANY overlap at all!
Moreover, I have tested the code on GeForce GTX 970 and it shows the best overlap for the cycle
{
int offset = i * streamSize;
checkCuda(cudaMemcpyAsync(&d_a[offset], &a[offset],
streamBytes, cudaMemcpyHostToDevice,
stream[i]));
}
for (int i = 0; i < nStreams; ++i)
{
int offset = i * streamSize;
kernel << <streamSize / blockSize, blockSize, 0, stream[i] >> >(d_a, offset);
}
for (int i = 0; i < nStreams; ++i)
{
int offset = i * streamSize;
checkCuda(cudaMemcpyAsync(&a[offset], &d_a[offset],
streamBytes, cudaMemcpyDeviceToHost,
stream[i]));
}
I’m not sure how to treat such a divergence with the official recommendations.
Does anybody know where could be the reason for it?
Many thanks for your time and help!
Sofya