streams not overlapping

Hello, I have something very similar to the code:

int k, no_streams = 4;

cudaStream_t stream[no_streams];

for(k = 0; k < no_streams; k++) cudaStreamCreate(&stream[k]);

cudaMalloc(&g_in,  size1*no_streams);

cudaMalloc(&g_out, size2*no_streams);

for (k = 0; k < no_streams; k++)

  cudaMemcpyAsync(g_in+k*size1/sizeof(float), h_ptr_in[k], size1, cudaMemcpyHostToDevice, stream[k]);

for (k = 0; k < no_streams; k++)

  mykernel<<<dimGrid, dimBlock, 0, stream[k]>>>(g_in+k*size1/sizeof(float), g_out+k*size2/sizeof(float));

for (k = 0; k < no_streams; k++)

  cudaMemcpyAsync(h_ptr_out[k], g_out+k*size2/sizeof(float), size2, cudaMemcpyDeviceToHost, stream[k]);

cudaThreadSynchronize();

cudaFree(g_in);

cudaFree(g_out);

‘h_ptr_in’ and ‘h_ptr_out’ are arrays of pointers allocated with cudaMallocHost (with no flags).

The problem is that the streams do not overlap.

In the visual profiler I can see the kernel execution from the first stream overlapping with the copy (H2D) from the second stream but nothing else overlaps.

I may not have resources to run 2 kernels (I think I do) but at least the kernel execution and copy should be overlaping, right?

And if I put all 3 (copy H2D, kernel execution, copy D2H) within the same for-loop none of them overlap…

Please HELP, what can be causing this?

I’m running on:

.Ubuntu 10.04 x64

.Device 0: “GeForce GTX 460”

CUDA Driver Version: 3.20

CUDA Runtime Version: 3.20

CUDA Capability Major/Minor version number: 2.1

Concurrent copy and execution: Yes

Concurrent kernel execution: Yes

the kernel launch are blocking when profiling or debugging
(with cuda 4, i saw an overlap in Computeprof )

If you want to test overlapping, try to time your calls with cudaEvents and compare the total time with the sum of partial time