Processing Order with Cuda Streams in 7.5

I am working with a Maxwell board, compute capability 5.2.
Trying to maximize throughput on this board, I have naively set up a few kernels, kernel1 takes up about 5X longer than kernel2 to complete. Either kernel consumes all GPU resources while it is processing. I have three streams. Each one does the same thing - Asynchronous PCIe transfer in, kernel1, kernel2, and asynchronous transfer out. Here is the general idea:

for (streamIdx = 0; streamIdx < 3; streamIdx++)
     cudaMemcpyAsync(deviceInPtr[streamIdx], hostInPtr[streamIdx], sizeof(float)*dataLength, cudaMemcpyHostToDevice, stream[streamIdx]);
     kernel1<<<numBlocks, threadsPerBlock, 0, stream[streamIdx]>>>(deviceInPtr[streamIdx], deviceInterimPtr[streamIdx], ...);
     kernel2<<<numBlocks, threadsPerBlock, 0, stream[streamIdx]>>>(deviceInterimPtr[streamIdx], deviceOutPtr[streamIdx], ...);
     cudaMemcpyAsync(hostOutPtr[streamIdx], deviceOutPtr[streamIdx], sizeof(float)*dataLength, cudaMemcpyDeviceToHost, stream[streamIdx]);

What I want to know is why does the scheduler perform kernel1 for all three streams consecutively and then move on to do kernel2 consecutively for all three streams? Why don’t the kernel calls execute in the order that they were queued? If I could get the scheduler to process kernel1 then kernel2 for each stream then the return traffic from the GPU would be more evenly distributed over time and actually have some semblance of concurrent transfers / kernel processing.
If anyone has any ideas other than adding in a bunch of slow synchronization points, I would be very happy to hear them.

1 Like

Because when you use streams, any such concept of “the order in which they were queued” is out the window, when we are talking about CUDA operations issued to independent streams.

The CUDA processing model makes no guarantee about the order of execution of operations issued to independent streams.

The scheduler is entitled to make any decision it wishes about execution order for operations issued to independent streams. Such decisions may change from GPU to GPU, CUDA version to CUDA version, or run to run.

That is the nature of streams. Asynchronous execution. If the scheduler enforced a depth-first order, someone would complain who wanted breadth-first. And if the scheduler enforced breadth first, someone would complain who wanted depth first.

On the contrary, the scheduler seeks to maximize performance within the constraints you give it.

1 Like

Thanks, txbob.
I think that without a timeline, there is a little misunderstanding here. While I agree with you that the scheduler is optimizing the GPU performance, the setup does not efficiently manage the PCIe I/O.
Here is what currently happens:

|===H2D===||====K1====| |K2||===D2H===||===H2D===||====K1====| etc
|===H2D===| |====K1====| |K2||===D2H===||===H2D===| |====K1====| etc
|===H2D===| |====K1====| |K2||===D2H===||===H2D===| etc

As you can see there is a bunch of time in there where no I/O is occurring and then all the outputs compete. There is some overlap of the D2H and H2D transfers, unlike the above depiction shows. Alternatively I would like
|===H2D===||====K1====||K2||===D2H===||===H2D===| |====K1====||K2| etc
|===H2D===| |====K1====||K2||===D2H===||===H2D===| |====K1====| etc
|===H2D===| |====K1====||K2||===D2H===||===H2D===| etc

I believe that the second version of the timeline gives me more throughput.

I see that my spaces were optimized out, so it didn’t show what I wanted

Basically because the K2 time is short, the output transfers all happen essentially at once and compete, extending the timeline. If we think about keeping the PCIe busy then we process depth first.

You can avoid having the spaces “optimized out” by marking the section of text as a code block.

As it currently stands:

|===H2D===||====K1====|                        |K2||===D2H===||===H2D===||====K1====| etc
 |===H2D===|           |====K1====|                |K2||===D2H===||===H2D===|        |====K1====| etc
  |===H2D===|                      |====K1====|        |K2||===D2H===||===H2D===| etc

So there is a lot of time where no I/O is happening, and instead it’s all bunched at the end.
The way I would like it:

|===H2D===||====K1====||K2||===D2H===||===H2D===|          |====K1====||K2| etc
 |===H2D===|               |====K1====||K2||===D2H===||===H2D===|          |====K1====||K2| etc
  |===H2D===|                              |====K1====||K2||===D2H===||===H2D===|            etc

Thus throughput is optimized.

Take a look at this blog:

I cannot see the rest of your code but did you allocate the host memory via cudaHostAlloc() in order to pin the memory?

Generally I have had nice success using using streams and multiple GPUs with nice overlap, but some tweaking may be necessary.

Yes, it is all pinned.

I did take a look at the blog, mfatica. I have tried different orderings and groupings of these calls with very little change in behavior.

When you execute asynchronous CUDA commands without specifying a stream, the runtime uses the default stream. Before CUDA 7, the default stream is a special stream which implicitly synchronizes with all other streams on the device.

i think that this particular case is pretty obvious. each kernel invocation consists of a number of thread blocks. GPU starts new thread blocks as previous thread blocks finishing their execution and free up SM resources. GPU usually schedules blocks from the already running kernel with higher priority, so once first invocation of K1 started, GPU continues to execute new K1 thread blocks until they are exhausted.

At this moment, finished K1 thread blocks continue to release SM resources, but GPU no more can start new K1 thread blocks. Neither it can start K2 blocks since K2 kernel may start only AFTER ALL K1 thread blocks will be finished, and therefore entire K1 kernel is considered as finished. So it starts to schedule blocks from another, independent, kernel.

The entire streams machinery was developed exactly for this purpose - overload execution of kernels, that is impossible in the single stream. This allows to reduce tail effect and better utilize GPU resources. If you want to run kernels in different order, you may send them in single stream and use async memcpy, probably fired by kernel completion events.

Another way to partially get the desired behavior is cudaStreamCreateWithPriority

1 Like

Thank you, BulatZiganshin, I was able to get the single stream working the way I wanted using events at the end of the second kernel. The PCIe input and output traffic is handled by different streams. That way it executes concurrently with the kernels.