No Performance Improvement from Overlapping Kernel/Memcpy

BHC · July 13, 2010, 4:09pm

I have a kernel that operates on a large array of elements, one after another. After using an element, that element can be discarded. My goal is to overlap the execution of the kernel with the transfer of the data from host to device.

Here is the approach I am using:

Divide the data set into B blocks.
Allocate memory in the device for 2 blocks: blockA and blockB.
Send the first block to the device in blockA.
Asynchronously start a kernel on blockA while copying the second block from the host into blockB.

(continue populating one block while using the other)

This should allow me to overlap execution with all data movement except for the first block. Here is the psuedocode:

//reserve space for each block in memory

cudaMalloc(void(**)&blockA, size);

cudaMalloc(void(**)&blockB, size);

for(int i=-1; i < blockCount; i++){

   if(i < blockCount - 1) cudaMemcpyAsync( (i+1)%2==0 ? blockA : blockB, ... , stream[i+1]);

   if(i > -1) doKernel<...,stream[i]> (i % 2 == 0 ? blockA : blockB);

   cudaThreadSynchronize();

}

(I know the indexing is a little weird. It basically just pairs the Memcpy for block i with the kernel launch of block i-1.)

Here’s the problem: When I run my code, I get no improvement in performance. The memCpy and kernel do not get overlapped.

If I move the cudaThreadSynchronize() outside the loop, I get the expected performance improvement (overlapping works), but the result is incorrect. I have to ensure that the kernel launch completes before I move on to the next kernel/memcpy pair.

Note: I do not have enough memory available in the device to hold all the blocks, so I can’t replace blockA and blockB with an array block[0…blockCount].

Any suggestions?

tera · July 13, 2010, 5:14pm

Do you initialize the streams with cudaStreamCreate() before using them?

BHC · July 13, 2010, 5:18pm

Yes I do.

tera · July 13, 2010, 5:22pm

Then this is really strange, as the streams should already provide the synchronization needed for correctness. Do the kernels depend on anything else than their respective blockA or blockB?

tera · July 13, 2010, 5:25pm

Actually, this is not strange. Try using only 2 streams, one for blockA and one for blockB.

BHC · July 13, 2010, 5:31pm

I don’t see how that is true. Each stream contains a memCpy to a block and a kernel that uses the same block. I see how the concept of a stream guarantees that the kernel will execute after its memcpy is complete. But since streams can execute concurrently, what will stop stream[2] from writing to blockA before stream[0] is done with it? That’s why I have the cudaThreadSynchronize inside the loop.

BHC · July 13, 2010, 5:32pm

Seems like that will work. Will try it and post back in a minute.

BHC · July 13, 2010, 5:48pm

OK I have it using only 2 streams, with the cudaThreadSynchronize outside the loop. (See modified psuedocode below.) You were right - this did produce the correct result. Actually, I think one could show that this is functionally equivalent to my original implementation.

Unfortunately, the performance is also equivalent. No improvement over blocking memcpys.

//reserve space for each block in memory

cudaMalloc(void(**)&blockA, size);

cudaMalloc(void(**)&blockB, size);

for(int i=-1; i < blockCount; i++){

   if(i < blockCount - 1) cudaMemcpyAsync( (i+1)%2==0 ? blockA : blockB, ... , stream[(i+1)%2]);

   if(i > -1) doKernel<...,stream[i%2]> (i % 2 == 0 ? blockA : blockB);

}

cudaThreadSynchronize();

tmurray · July 13, 2010, 5:51pm

Enqueue all of your memcpys, then enqueue all of your launches.

tera · July 13, 2010, 5:57pm

Does your kernel take the same time to execute on each invocation?

BHC · July 13, 2010, 6:02pm

OK I am doing this now.

There is one thing that I left out originally. Not sure if this matters but…

The cudaMemcpyAsync is not happening as one big chunk. I wrote it that way in the psuedocode for clarity. The cudaMemcpyAsync call that populates the block is actually about a hundred separate calls. I am aware that sending data from host to device in small pieces lowers bandwidth, which is part of the reason why masking that communication is so important. Does that fact make a difference?

tmurray · July 13, 2010, 6:04pm

yes, that makes it far more likely to block as you fill certain resources

BHC · July 13, 2010, 6:04pm

Yes it does.

BHC · July 13, 2010, 6:09pm

sigh is this documented somewhere so I can see what resources I am using up, and what the limits are?

Sorry for not including this in the OP.

tmurray · July 13, 2010, 6:15pm

no, it’s hardware and driver dependent. you have a fixed number of calls that can be enqueued at any given time; as you enqueue more work, you will eventually have to block and wait for some work to drain.

BHC · July 13, 2010, 6:23pm

Thank you both for your time and help. This is going to take a lot of work to fix. I’ll post back when I get it working to confirm FTR that queue length was the problem.

BHC · July 14, 2010, 2:58pm

The problem is resolved. I modified my code to pack everything into one contiguous block of memory on the host side so it could be transferred in a single memcpy. Thank you again for helping me figure this out.

To future Googlers: here is the TL;DR

There is a limit to the number of asynchronous calls that can be in the queue at any time. That limit varies depending on device/driver, and might be undocumented. For the Fermi C2050 using default drivers, the limit is less than 100. If you have a large number of memcpys to do, the solution (or workaround) is to use a buffer on the host, then send the buffer to the device in one big memcpy.

Topic		Replies	Views
Concurrent Kernel Execution / Memory Transfer We can't get it to work... CUDA Programming and Performance	5	4112	March 21, 2009
Asynchronous kernel execution and memory not overlapping using CUDA stream! CUDA Programming and Performance	3	976	July 7, 2017
Kernel Queueing CUDA Programming and Performance	8	9819	June 29, 2009
How to overlap execution of kernels in different streams with copy operations CUDA Programming and Performance	8	1130	January 18, 2022
Asynchronous multi streaming: not working... CUDA Programming and Performance	2	578	May 13, 2018
Don't observe overlapping behavior in streams CUDA Programming and Performance cuda , kernel	2	103	July 15, 2025
How to Overlap Data Transfers in CUDA C/C++ Technical Blog	23	2611	January 18, 2023
asynchronous memory transfer CUDA Programming and Performance	2	1721	October 29, 2008
performance variation when using asynchronous calls CUDA Programming and Performance	1	676	February 11, 2011
Concurrent copy & execution problem Device to host memory copy is not overlapped with kernel exe CUDA Programming and Performance	1	1840	June 23, 2010

No Performance Improvement from Overlapping Kernel/Memcpy

Related topics