Problems with Streams Very strange!!!

Hi all!

I wanted to improve the time delay of my algorithm by streams. But the results i got by cuda profiler are strange. I think kernel and memcpy function are performed sequential. I hope you understand what I got… This are the values the profiler shows before using streams:

Method______|_Timestamp|GPUTime|CPUTIme|Occupancy| streamId|gridsize|blocksize|sm cta lauched |cta launched|

memcpyHtoD_|_____ 0 ____|110058|_4,795|

Kernel_______|110071| 4001,22|113280| 1___|1_|72| 256 | 3_________|_____ 8 ______|

memcpyDtoH_|_114243| 87783,4|_5,159|

And this those WITH (three) streams:

Method______|_Timestamp|GPUTime|CPUTIme|Occupancy| streamId|gridsize|blocksize|sm cta lauched |cta launched|

memcpyHtoD_|____ 0______|36684_|_4,795|

memcpyHtoD_|___ 36685___|36739_|_4,795|


memcpyHtoD_|110172|2207__|111600|1_ |1|24|256|______1 ______|3|

memcpyHtoD_|112602|2201__|2216__|1_ |2|24|256|1_ |3_|

memcpyHtoD_|114994|2195__|2239__|1_ |3|24|256|1_ |2_|




You see, there is no improvement of time delay. I know, the kernel execution time is to short to get really significant differences (There is not much time for overlapping with memcpy, but this is only a test), but I should see small differences, shouldn’t I?

The memcpy-functions don’t show a stream number? Is this right? I read in the profiler manual, it says that the stream id usually is show in every function of a stream (so, in memcpy functions, too).

What did I wrong??

I’m using a Geforce260GTX with CC1.3 and SDK2.3, Toolkit2.3. For the implementation of streams I don’t use much more code than in the code example of the programming guide. Here are some parts:


//Creating streams

cudaStream_t *Stream = NULL;

Stream = (cudaStream_t*)malloc(uiNumOfStreams * sizeof(cudaStream_t));

for (int i = NULL; i < uiNumOfStreams; ++i)


//Memcpy for both directions

for(int i = NULL; i < uiNumOfStreams; ++i)


unsigned int uiOffset = i * uiWidth * uiHeight * m_uiImagesPerStream;

if(uiDirection == cudaMemcpyHostToDevice)


  Destination = CudaBase::GetDeviceMemoryPtr().idtInputDevData + uiOffset;

  Source = CudaBase::sm_HostPtr + uiOffset;




  Source = CudaBase::GetDeviceMemoryPtr().idtOutputDevData + uiOffset;

  Destination = CudaBase::sm_HostPtr + uiOffset;


cudaMemcpyAsync( Destination, Source, GetCopySize(i, uiNumberOfImages, uiWidth, uiHeight),(cudaMemcpyKind) uiDirection, Stream[i]);


//Kernel execution

for(int i = NULL; i < uiNumOfStreams; ++i)


unsigned int uiOffset = i * uiWidth * uiHeight * uiImagesPerStream;

if(uiNumberOfImages - uiImagesPerStream * i >= uiImagesPerStream)

  uiImgPerCurrStream = uiImagesPerStream;


  uiImgPerCurrStream = uiNumberOfImages - uiImagesPerStream * i;

KernelXY<<<(uiWidth / NUMOFTHREADS) * uiImgPerCurrStream, NUMOFTHREADS, NULL, Stream[i]>>>( idtInputDevData + uiOffset, idtOutputDevData + uiOffset, uiWidth, uiHeight);


//destroy streams

for (int i = NULL; i < uiNumOfStreams; ++i)



Some additional questions:

  1. Is it possible to perform data transfers between host und device in both directions concurrent?

  2. What is the meaning of the profiler column “cta launched”? I know it stands for the number of threads executed, but how did they get the values above? How can be cta launched = 3 if the gird size is 24? How can the number of executed blocks be different in two identity kernels (in this case the values are2 and 3)?

I hope you can help me!

Best regards!

Edit: Naturally I use page locked memory. And I never call any cuda function with stream id = NULL inbetween the other commands.

Hi, I’m having the same problem… Asynchronous memcpy’s all show up in stream0 in the profiler, although I pass them a valid non-zero stream argument. I haven’t yet verified if they are able to overlap with computing in other streams or not.

Did you find out if this is just a profiler bug, or if it’s something to worry about?