Hello, I have something very similar to the code:
int k, no_streams = 4; cudaStream_t stream[no_streams]; for(k = 0; k < no_streams; k++) cudaStreamCreate(&stream[k]); cudaMalloc(&g_in, size1*no_streams); cudaMalloc(&g_out, size2*no_streams); for (k = 0; k < no_streams; k++) cudaMemcpyAsync(g_in+k*size1/sizeof(float), h_ptr_in[k], size1, cudaMemcpyHostToDevice, stream[k]); for (k = 0; k < no_streams; k++) mykernel<<<dimGrid, dimBlock, 0, stream[k]>>>(g_in+k*size1/sizeof(float), g_out+k*size2/sizeof(float)); for (k = 0; k < no_streams; k++) cudaMemcpyAsync(h_ptr_out[k], g_out+k*size2/sizeof(float), size2, cudaMemcpyDeviceToHost, stream[k]); cudaThreadSynchronize(); cudaFree(g_in); cudaFree(g_out);
‘h_ptr_in’ and ‘h_ptr_out’ are arrays of pointers allocated with cudaMallocHost (with no flags).
The problem is that the streams do not overlap.
In the visual profiler I can see the kernel execution from the first stream overlapping with the copy (H2D) from the second stream but nothing else overlaps.
I may not have resources to run 2 kernels (I think I do) but at least the kernel execution and copy should be overlaping, right?
And if I put all 3 (copy H2D, kernel execution, copy D2H) within the same for-loop none of them overlap…
Please HELP, what can be causing this?
I’m running on:
.Ubuntu 10.04 x64
.Device 0: “GeForce GTX 460”
CUDA Driver Version: 3.20
CUDA Runtime Version: 3.20
CUDA Capability Major/Minor version number: 2.1
Concurrent copy and execution: Yes
Concurrent kernel execution: Yes