Concurrent execution problem Try to understand how to achieve the data and execution concurrency

I try to experience with concurrent data transfer and computation on Fermi.

This is a very simple test, extracted from the NVIDIA CUDA Programming guide,

[codebox]#include <cutil_inline.h>

global void MyKernel(float* d_o,float* d_i, int size)


int idx = blockIdx.x * blockDim.x + threadIdx.x;

if (idx < size)


    d_o[idx] = d_i[idx] + 1;



int main(int argc, char** argv)


const int nS = 4;

cudaStream_t stream[nS];

int size = 65000 * 512;

for (int i = 0; i < nS; ++i)


float* hostPtr;

cudaMallocHost(&hostPtr, nS * size * sizeof(float));

float* inputDevPtr;

float* outputDevPtr;

cudaMalloc((void**)&inputDevPtr, nS * size * sizeof(float));

cudaMalloc((void**)&outputDevPtr, nS * size * sizeof(float));

for (int i = 0; i < nS; ++i)

    cudaMemcpyAsync(inputDevPtr + i * size, hostPtr + i * size, size * sizeof(float), cudaMemcpyHostToDevice, stream[i]);

for (int i = 0; i < nS; ++i)

    MyKernel<<<65000, 512, 0, stream[i]>>> (outputDevPtr + i * size, inputDevPtr + i * size, size);

for (int i = 0; i < nS; ++i)

    cudaMemcpyAsync(hostPtr + i * size, outputDevPtr + i * size, size * sizeof(float), cudaMemcpyDeviceToHost, stream[i]);



cutilExit(argc, argv);



I use GPU TIme width plot to display the execution of the kernel.

As we can see only the first stream kernel overlap with second stream data copy from Host to Device. And there are no overlap between other kernel with memory copy from Host to Device and Device To Host.

I can not understand why it happen. Could someone shed a light on the problem ?

And then my question is :

    [*] How to achieve the concurrency, what is the requirement that the code should obey.

    [*] How to manage CUDA scheduler, it is seems obvious to me that the kernel on the second stream could execute right a way after its input data available, while the scheduler forces its execution delay until all data available

    [*] Is the GTX 480 incapable of overlapping Host to Device and Device To Host memory copy. It would be great since it almost doubles speed and increases PCIe bus usage.

Thank you for your help

Linh Ha, sorry to be of no help with your problem, but how did you get the image? Did you use any standard tool?

No problem :) I got it from cudaprof

It turns out that cudaprof is incapable of displaying stream execution since its probes when inserting to the stream break the regular stream execution.

Is this a bug of cudaprof. What else I can use to monitor stream execution.

And it is also sad that even the Fermi 480 is incapable of overlapping between H2D and D2H data transfer.

As stated in the 3.1 Programming Guide (, p. 38), GTX 480 should be able to overlap transfers when using pinned mem and one h2d-memcpy + one d2h-memcpy. Looking at your img I think there should be overlapping h2d/d2h. I would be very sad if this doesnt work cause I got a GTX 480 too…