No Performance Improvement from Overlapping Kernel/Memcpy

I have a kernel that operates on a large array of elements, one after another. After using an element, that element can be discarded. My goal is to overlap the execution of the kernel with the transfer of the data from host to device.

Here is the approach I am using:

  1. Divide the data set into B blocks.

  2. Allocate memory in the device for 2 blocks: blockA and blockB.

  3. Send the first block to the device in blockA.

  4. Asynchronously start a kernel on blockA while copying the second block from the host into blockB.

(continue populating one block while using the other)

This should allow me to overlap execution with all data movement except for the first block. Here is the psuedocode:

//reserve space for each block in memory

cudaMalloc(void(**)&blockA, size);

cudaMalloc(void(**)&blockB, size);

for(int i=-1; i < blockCount; i++){

   if(i < blockCount - 1) cudaMemcpyAsync( (i+1)%2==0 ? blockA : blockB, ... , stream[i+1]);

   if(i > -1) doKernel<...,stream[i]> (i % 2 == 0 ? blockA : blockB);



(I know the indexing is a little weird. It basically just pairs the Memcpy for block i with the kernel launch of block i-1.)

Here’s the problem: When I run my code, I get no improvement in performance. The memCpy and kernel do not get overlapped.

If I move the cudaThreadSynchronize() outside the loop, I get the expected performance improvement (overlapping works), but the result is incorrect. I have to ensure that the kernel launch completes before I move on to the next kernel/memcpy pair.

Note: I do not have enough memory available in the device to hold all the blocks, so I can’t replace blockA and blockB with an array block[0…blockCount].

Any suggestions?

Do you initialize the streams with cudaStreamCreate() before using them?

Yes I do.

Then this is really strange, as the streams should already provide the synchronization needed for correctness. Do the kernels depend on anything else than their respective blockA or blockB?

Actually, this is not strange. Try using only 2 streams, one for blockA and one for blockB.

I don’t see how that is true. Each stream contains a memCpy to a block and a kernel that uses the same block. I see how the concept of a stream guarantees that the kernel will execute after its memcpy is complete. But since streams can execute concurrently, what will stop stream[2] from writing to blockA before stream[0] is done with it? That’s why I have the cudaThreadSynchronize inside the loop.

Seems like that will work. Will try it and post back in a minute.

OK I have it using only 2 streams, with the cudaThreadSynchronize outside the loop. (See modified psuedocode below.) You were right - this did produce the correct result. Actually, I think one could show that this is functionally equivalent to my original implementation.

Unfortunately, the performance is also equivalent. No improvement over blocking memcpys.

//reserve space for each block in memory

cudaMalloc(void(**)&blockA, size);

cudaMalloc(void(**)&blockB, size);

for(int i=-1; i < blockCount; i++){

   if(i < blockCount - 1) cudaMemcpyAsync( (i+1)%2==0 ? blockA : blockB, ... , stream[(i+1)%2]);

   if(i > -1) doKernel<...,stream[i%2]> (i % 2 == 0 ? blockA : blockB);



Enqueue all of your memcpys, then enqueue all of your launches.

Does your kernel take the same time to execute on each invocation?

OK I am doing this now.

There is one thing that I left out originally. Not sure if this matters but…

The cudaMemcpyAsync is not happening as one big chunk. I wrote it that way in the psuedocode for clarity. The cudaMemcpyAsync call that populates the block is actually about a hundred separate calls. I am aware that sending data from host to device in small pieces lowers bandwidth, which is part of the reason why masking that communication is so important. Does that fact make a difference?

yes, that makes it far more likely to block as you fill certain resources

Yes it does.

sigh is this documented somewhere so I can see what resources I am using up, and what the limits are?

Sorry for not including this in the OP.

no, it’s hardware and driver dependent. you have a fixed number of calls that can be enqueued at any given time; as you enqueue more work, you will eventually have to block and wait for some work to drain.

Thank you both for your time and help. This is going to take a lot of work to fix. I’ll post back when I get it working to confirm FTR that queue length was the problem.

The problem is resolved. I modified my code to pack everything into one contiguous block of memory on the host side so it could be transferred in a single memcpy. Thank you again for helping me figure this out.

To future Googlers: here is the TL;DR

There is a limit to the number of asynchronous calls that can be in the queue at any time. That limit varies depending on device/driver, and might be undocumented. For the Fermi C2050 using default drivers, the limit is less than 100. If you have a large number of memcpys to do, the solution (or workaround) is to use a buffer on the host, then send the buffer to the device in one big memcpy.