Processing Order with Cuda Streams in 7.5

agardiner · June 10, 2016, 1:00am

I am working with a Maxwell board, compute capability 5.2.
Trying to maximize throughput on this board, I have naively set up a few kernels, kernel1 takes up about 5X longer than kernel2 to complete. Either kernel consumes all GPU resources while it is processing. I have three streams. Each one does the same thing - Asynchronous PCIe transfer in, kernel1, kernel2, and asynchronous transfer out. Here is the general idea:

for (streamIdx = 0; streamIdx < 3; streamIdx++)
{
     cudaMemcpyAsync(deviceInPtr[streamIdx], hostInPtr[streamIdx], sizeof(float)*dataLength, cudaMemcpyHostToDevice, stream[streamIdx]);
     kernel1<<<numBlocks, threadsPerBlock, 0, stream[streamIdx]>>>(deviceInPtr[streamIdx], deviceInterimPtr[streamIdx], ...);
     kernel2<<<numBlocks, threadsPerBlock, 0, stream[streamIdx]>>>(deviceInterimPtr[streamIdx], deviceOutPtr[streamIdx], ...);
     cudaMemcpyAsync(hostOutPtr[streamIdx], deviceOutPtr[streamIdx], sizeof(float)*dataLength, cudaMemcpyDeviceToHost, stream[streamIdx]);
}

What I want to know is why does the scheduler perform kernel1 for all three streams consecutively and then move on to do kernel2 consecutively for all three streams? Why don’t the kernel calls execute in the order that they were queued? If I could get the scheduler to process kernel1 then kernel2 for each stream then the return traffic from the GPU would be more evenly distributed over time and actually have some semblance of concurrent transfers / kernel processing.
If anyone has any ideas other than adding in a bunch of slow synchronization points, I would be very happy to hear them.
Thanks!

Robert_Crovella · June 10, 2016, 2:11am

Because when you use streams, any such concept of “the order in which they were queued” is out the window, when we are talking about CUDA operations issued to independent streams.

The CUDA processing model makes no guarantee about the order of execution of operations issued to independent streams.

The scheduler is entitled to make any decision it wishes about execution order for operations issued to independent streams. Such decisions may change from GPU to GPU, CUDA version to CUDA version, or run to run.

That is the nature of streams. Asynchronous execution. If the scheduler enforced a depth-first order, someone would complain who wanted breadth-first. And if the scheduler enforced breadth first, someone would complain who wanted depth first.

On the contrary, the scheduler seeks to maximize performance within the constraints you give it.

agardiner · June 10, 2016, 7:40pm

Thanks, txbob.
I think that without a timeline, there is a little misunderstanding here. While I agree with you that the scheduler is optimizing the GPU performance, the setup does not efficiently manage the PCIe I/O.
Here is what currently happens:

|===H2D===||====K1====| |K2||===D2H===||===H2D===||====K1====| etc
|===H2D===| |====K1====| |K2||===D2H===||===H2D===| |====K1====| etc
|===H2D===| |====K1====| |K2||===D2H===||===H2D===| etc

As you can see there is a bunch of time in there where no I/O is occurring and then all the outputs compete. There is some overlap of the D2H and H2D transfers, unlike the above depiction shows. Alternatively I would like
|===H2D===||====K1====||K2||===D2H===||===H2D===| |====K1====||K2| etc
|===H2D===| |====K1====||K2||===D2H===||===H2D===| |====K1====| etc
|===H2D===| |====K1====||K2||===D2H===||===H2D===| etc

I believe that the second version of the timeline gives me more throughput.

agardiner · June 10, 2016, 7:41pm

I see that my spaces were optimized out, so it didn’t show what I wanted

agardiner · June 10, 2016, 7:44pm

Basically because the K2 time is short, the output transfers all happen essentially at once and compete, extending the timeline. If we think about keeping the PCIe busy then we process depth first.

Robert_Crovella · June 10, 2016, 7:57pm

You can avoid having the spaces “optimized out” by marking the section of text as a code block.

agardiner · June 10, 2016, 8:14pm

As it currently stands:

|===H2D===||====K1====|                        |K2||===D2H===||===H2D===||====K1====| etc
 |===H2D===|           |====K1====|                |K2||===D2H===||===H2D===|        |====K1====| etc
  |===H2D===|                      |====K1====|        |K2||===D2H===||===H2D===| etc

So there is a lot of time where no I/O is happening, and instead it’s all bunched at the end.
The way I would like it:

|===H2D===||====K1====||K2||===D2H===||===H2D===|          |====K1====||K2| etc
 |===H2D===|               |====K1====||K2||===D2H===||===H2D===|          |====K1====||K2| etc
  |===H2D===|                              |====K1====||K2||===D2H===||===H2D===|            etc

Thus throughput is optimized.

mfatica · June 10, 2016, 9:24pm

Take a look at this blog:

CudaaduC · June 10, 2016, 10:15pm

I cannot see the rest of your code but did you allocate the host memory via cudaHostAlloc() in order to pin the memory?

Generally I have had nice success using using streams and multiple GPUs with nice overlap, but some tweaking may be necessary.

agardiner · June 11, 2016, 12:15am

Yes, it is all pinned.

agardiner · June 11, 2016, 12:19am

I did take a look at the blog, mfatica. I have tried different orderings and groupings of these calls with very little change in behavior.

jenifferhomes · June 11, 2016, 10:52am

When you execute asynchronous CUDA commands without specifying a stream, the runtime uses the default stream. Before CUDA 7, the default stream is a special stream which implicitly synchronizes with all other streams on the device.
Thanks
Qwikfix

BulatZiganshin · June 11, 2016, 11:42am

i think that this particular case is pretty obvious. each kernel invocation consists of a number of thread blocks. GPU starts new thread blocks as previous thread blocks finishing their execution and free up SM resources. GPU usually schedules blocks from the already running kernel with higher priority, so once first invocation of K1 started, GPU continues to execute new K1 thread blocks until they are exhausted.

At this moment, finished K1 thread blocks continue to release SM resources, but GPU no more can start new K1 thread blocks. Neither it can start K2 blocks since K2 kernel may start only AFTER ALL K1 thread blocks will be finished, and therefore entire K1 kernel is considered as finished. So it starts to schedule blocks from another, independent, kernel.

The entire streams machinery was developed exactly for this purpose - overload execution of kernels, that is impossible in the single stream. This allows to reduce tail effect and better utilize GPU resources. If you want to run kernels in different order, you may send them in single stream and use async memcpy, probably fired by kernel completion events.

Another way to partially get the desired behavior is cudaStreamCreateWithPriority

agardiner · June 24, 2016, 8:19pm

Thank you, BulatZiganshin, I was able to get the single stream working the way I wanted using events at the end of the second kernel. The PCIe input and output traffic is handled by different streams. That way it executes concurrently with the kernels.

Topic		Replies	Views
Why kernel executions in different streams are not parallel? CUDA Programming and Performance	4	2654	April 29, 2019
Time intervals and non-concurrent in multi streaming CUDA Programming and Performance cuda	6	576	April 6, 2023
streams vs. direct use of zero copy memory CUDA Programming and Performance	14	13126	March 30, 2011
My streams are not running concurrently CUDA Programming and Performance	7	1775	March 6, 2018
How to Overlap Data Transfers in CUDA C/C++ Technical Blog	23	2227	January 18, 2023
CUDA and NPP Misc Issues CUDA Programming and Performance	6	1451	March 28, 2011
Computation and PCIe tranfers overlaping with callbacks and events. CUDA Programming and Performance	7	914	July 7, 2016
multi task parallelization with cuda streams ? CUDA Programming and Performance	7	1457	September 14, 2017
async memcopy/kernel from different contexts overlaping operations from different contexts.. CUDA Programming and Performance	9	2949	December 18, 2008
Overlapping execution / data transfer & kernel execution order CUDA Programming and Performance	2	675	December 10, 2015

Processing Order with Cuda Streams in 7.5

Related topics