streams not overlapping

pmcr · May 19, 2011, 4:05pm

Hello, I have something very similar to the code:

int k, no_streams = 4;

cudaStream_t stream[no_streams];

for(k = 0; k < no_streams; k++) cudaStreamCreate(&stream[k]);

cudaMalloc(&g_in,  size1*no_streams);

cudaMalloc(&g_out, size2*no_streams);

for (k = 0; k < no_streams; k++)

  cudaMemcpyAsync(g_in+k*size1/sizeof(float), h_ptr_in[k], size1, cudaMemcpyHostToDevice, stream[k]);

for (k = 0; k < no_streams; k++)

  mykernel<<<dimGrid, dimBlock, 0, stream[k]>>>(g_in+k*size1/sizeof(float), g_out+k*size2/sizeof(float));

for (k = 0; k < no_streams; k++)

  cudaMemcpyAsync(h_ptr_out[k], g_out+k*size2/sizeof(float), size2, cudaMemcpyDeviceToHost, stream[k]);

cudaThreadSynchronize();

cudaFree(g_in);

cudaFree(g_out);

‘h_ptr_in’ and ‘h_ptr_out’ are arrays of pointers allocated with cudaMallocHost (with no flags).

The problem is that the streams do not overlap.

In the visual profiler I can see the kernel execution from the first stream overlapping with the copy (H2D) from the second stream but nothing else overlaps.

I may not have resources to run 2 kernels (I think I do) but at least the kernel execution and copy should be overlaping, right?

And if I put all 3 (copy H2D, kernel execution, copy D2H) within the same for-loop none of them overlap…

Please HELP, what can be causing this?

I’m running on:

.Ubuntu 10.04 x64

.Device 0: “GeForce GTX 460”

CUDA Driver Version: 3.20

CUDA Runtime Version: 3.20

CUDA Capability Major/Minor version number: 2.1

Concurrent copy and execution: Yes

Concurrent kernel execution: Yes

fcs · May 23, 2011, 1:11pm

the kernel launch are blocking when profiling or debugging
(with cuda 4, i saw an overlap in Computeprof )

If you want to test overlapping, try to time your calls with cudaEvents and compare the total time with the sum of partial time

Topic		Replies	Views
Weird behaviour of CUDA streams CUDA Programming and Performance	0	1909	June 17, 2010
Any method for time overlap? CUDA Programming and Performance	2	4548	April 13, 2009
Strange behavior with overlap of transfer and compute CUDA Programming and Performance	4	3988	October 19, 2011
Overlapping memcpyasync and kernel execution CUDA Programming and Performance	0	1105	July 28, 2008
about streaming style sample code in Programming Guide ... why such a style? CUDA Programming and Performance	5	1468	January 23, 2009
Conditions for CUDA streams to overlap CUDA Programming and Performance	5	4469	June 9, 2013
Cuda Streams for Concurrent Kernel Calls CUDA Programming and Performance	1	2295	October 26, 2016
Concurrent execution problem Try to understand how to achieve the data and execution concurrency CUDA Programming and Performance	4	1550	July 9, 2010
Overhead of using more than one streams? CUDA Programming and Performance	5	6239	April 14, 2009
concurrent copy and execution CUDA Programming and Performance	0	1628	November 6, 2009

streams not overlapping

Related topics