Can I use streaming to overlap kernels and data transfers in this scenario?

Our desktop application uses a 3rd-party high speed data acquisition PCI card that acquires a stream of data and sends it to the host app on a periodic basis - for the purpose of this example let’s say this is 4Mb every 1ms.
Our app employs a “read loop” that waits for a signal from the PCI card to say that it has written to that host buffer, which we then cudaMemcpy to a GPU buffer. The read loop will repeat this process multiple times, each time appending the new block of acquired data to that GPU buffer. Depending on different factors/settings, it might repeat this 3 or 4 times, or might be 300-400 times. For the purpose of this example let’s say it does 200 transfers.
Once the required number of blocks have been appended to the GPU buffer, the read loop will then run four processing kernels. These must be run sequentially, as each one “transforms” the acquired data in some way, ready for the next kernel to act on. Once those kernels have finished, a D2H is used to transfer a small amount of “results” data (a few tens of Kb at most) back to the host. The process then repeats, and the entire acquisition might run for just a few seconds or as much as 20 minutes.

I thought streams would make perfect sense here, i.e. run the kernels on one stream, while the read loop is going around waiting for, and transferring, the next 200 blocks of data on another stream. For the purpose of this example, lets say each of the four kernels takes 10ms, so if each of the 200 data blocks arrives at 1ms intervals then I’d expect the kernels to overlap the first ~20% of those data transfers.

Having made the necessary code changes, I’m not seeing any overlapping in NSight. I can confirm that all host buffers are pinned. After further reading up on streams, it seems that you don’t simply enqueue the operations sequentially, as we’re effectively doing, and I’ve seen numerous mentions of “breadth first”, although I’m still struggling to understand some of the concepts, and causes of blocking (like [this one] (https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf), from slide 15 onwards).

I’m now thinking I have been naive in my understanding of how streaming works. What I’m still not entirely clear on is whether those two streams are totally independent, or is each action like a “slot” whereby in one slot the GPU will execute a D2H and a kernel, then in the next slot execute the next D2H and kernel, and so on, with the time taken in each slot equal to the longest of the two operations? If this is true then it presumably means the four kernels would overlap with the first four D2H operations (effectively taking 10ms each), before the remaining 196 data transfers complete as normal?

I’m starting to think that streams isn’t suitable for this application, especially due to the read loop having to wait for a signal from the PCI card before it can do the H2D transfer?

The streams are independent, there are no slots.

So you have

A. ~200 transfers, each of size 4MB, each happening at 1ms intervals.
B. A sequence of 4 kernel calls that must be performed on that data described in A. This sequence requires ~40ms.
C. After the kernels, a small D2H transfer is necessary. This should only require on the order of 10’s or 100’s of microseconds.

That should be readily stream-able. If, without streams, the entire processing for that “packet” described in A,B,C above takes 240ms, then you should be able to reduce the processing duration for the packet to ~200ms on average, assuming a continuous stream of packet delivery.

In addition to the webinar you linked, unit 7 of this online training series goes into the concept of packet-ized stream processing of work, in some detail.

For proper design, each of the items A,B,C for a given packet should be issued into the same stream, and the packet work (A,B,C) for separate packets should be issued into separate streams.

Depth-first vs. breadth-first will become clearer when you get a good grip on what the work issuance loop should look like. Depth first refers to the idea that we will issue in the order ABCABCABCABCABC (for 5 packets.) Breadth-first refers to the idea that we will issue in the order AAAAABBBBBCCCCC for 5 packets. The ideas are not entirely interchangeable: for example an unending stream of work may be more amenable to depth-first work issuance rather than breadth-first. In your case, depth-first is what makes sense to me as a starting point.

I’m not seeing any overlapping in NSight.

The primary overlap you would want to observe in this case is that the kernel-processing for a given packet (i.e. step B) overlaps with the incoming data delivery (step A) for a subsequent packet.

You would need more than (including transfer time) 40 streams in your example (packets every 1ms), not just 4 streams.

For a first version, I would recommend to start with similar time periods for the steps. You can optimize later on. Use packets of 40Mb (is it MByte or Mbit?) gathered over 10ms, so you have a rhythm of 10ms and only need a handful of streams like 4 or 5 instead of 40 or 50.

(If the data does not belong together, just process that data in a grid dimension or loop within the kernel or separate kernel calls on a host loop within the stream.)

Do not forget the effect of the L2 cache. Your sizes are in a range, where overlapping could worsen cache hit rate (depending on the size of the kernel inputs and outputs and your used GPU).

Do you want to optimize for throughput or average latency?

I’m not sure why. You only need (and would only want) one stream for the processing of a packet. You would use the same stream for all the individual H2D transfers associated with step A of a packet, and that same stream again for the 4 kernels, and that same stream again for the D2H transfer at the end of packet processing. You would then use another different stream for the next packet.

If 40 packets are in flight, then all need their individual stream.

The alternative is to process groups of packets with one stream. This is more or less, what I suggested. To combine the data over 10ms (instead of 1ms) and process this amount of data with one stream (h2d, 4 kernels, d2h) and the data from the next 10ms concurrently with a different stream. So only about 4 or 5 streams are needed.

Or have I misunderstood?

Where did the number 40 come from? I don’t see that in OP’s post. And if we judge the total number of packets by OP’s statement:

then the total number of packets could be far more than 40.

No, all packets do not need a unique stream (not even practical for a streaming situation or where you have 20 minutes of acquisition). You only need enough streams to enable the necessary concurrency scenarios. A number like 3 or 4 streams is usually sufficient, especially for a simple streaming case like this. Streams can be reused. It’s likely that by issuing an order like this:

Packet 1: Stream 1
Packet 2: Stream 2
Packet 3: Stream 3
Packet 4: Stream 4
Packet 5: Stream 1
Packet 6: Stream 2

and so on, that no further benefit would arise from using more than 4 streams.

Hi Robert,

I took it from that quote:

4 kernels * 10ms = 40ms.

200 * 1ms = 200ms.

20% * 200ms = 40ms.

The big question is, when the 40ms execution time is true for a single packet, why should it be possible to stream and process a new packet each 1ms? Wouldn’t the resources of the GPU already be occupied and further parallel streams slow down the 40ms of each packet?

The only class of cases, I can imagine, is, when each packet cannot be well parallelized or needs to exchange data with shared memory, so it is implemented as 1 CUDA block per packet (i.e. grid size == 1). Then many packets (at least the number of SMs) would be needed to fill all SMs. If the aim is to optimize the latency to the last bit, then each packet has to be started as early as possible, i.e. in its own stream and cannot wait for other packets arriving. And each packet would get its own dedicated SM. The maximum number of resident grids allowed with recent architectures is 128.

So at least such cases are possible. Not sure (not enough information), whether OP’s actual case qualifies.

That said, I was also recommending using less packets in the beginning of the optimization.

If a stream processes one packet after another depth first, and the 4 kernels alone take at least 40ms to execute, and new packets arrive each 1ms, then at least 40 concurrent streams are needed to keep up with the arriving packets, even when reusing streams.

Thanks for the replies so far. I thought it might be useful to clarify what the “read loop” is doing:

while (not received all data packets)
{
waitForBuffer(); // Wait for acq card to write packet into the host “transfer” buffer (blocking)
cudaMemcpyAsync(H2D, stream1); // Copy transfer buffer to GPU (appends)
cudaStreamSynchronize(stream1);
if (GPU buffer contains required number of packets)
{
execute the processing kernels (stream 1)
receive results (D2H, stream 1)
}
} // end while
sync both streams, cleanup etc.

(Apologies, I couldn’t find a way to maintain the indents, even with the “preformatted text” option).

waitForBuffer() is a function of the acquisition card API. As mentioned above, this is a blocking call, and in NSight you’ll see the packets arriving ~1ms apart (controlled by an external h/w trigger), while each transfer only takes ~350us, so the read loop is spending a lot of time waiting for the data to arrive. (Again this can vary depending on how the acquisition has been configured: packet size, trigger frequency, etc).

I added the cudaStreamSynchronise after each H2D transfer to ensure the transfers happen “there-and-then”, on the data in the “transfer” buffer. Without this, NSight showed the H2D transfers executing tens of ms later (presumably due to the async nature of the streams), by which time the data in the host transfer buffer would no longer be “current”, resulting in the “wrong” data being copied to the GPU.

And just to clarify something that Robert_Crovella touched on with his “AAABBBCCC” suggestion. We have to copy all (say) 200 packets to the GPU buffer before processing that buffer as a single entity (so “AAA…200…AAABC AAA…200…AAABC” and so on). The nature of what we’re doing means we can’t break the problem down by processing each packet individually, which I guess was Robert’s understanding.

Now that things are (hopefully) a little clearer, and knowing that the read loop has to block between each H2D copy, is streams still a viable option here?

So you want to break down the copies in e.g. 200 pieces, but not the 4 kernel calls, which process all the data? The 4 kernel calls are fast enough compared to the speed, with which data arrives (kernels are 2.5x faster)?

Yes, the data arrives in 200 pieces/packets (due to the way this is acquired from an external hardware device). Each cudaMemcpyAsync then appends these packets to a GPU buffer, and once all 200 pieces are in there then the kernels are run to process that single set of data.
Using my example, the overall time to transfer those 200 packets will be 200x1ms, while total kernel execution time is 40ms, so it feels like there is a “saving” to be made by overlapping the kernel executions with the next 200 data transfers (assuming this is possible).

(A bit more info: when running sequentially (no streams), you might have realised that we can’t receive any further data during that 40ms period when the kernels are executing. Here, the packets get buffered on the PCI acquisition card. Then, when our read loop resumes after the kernels have finished, each call to the “wait…()” API function will retrieve the buffered packets/pieces “instantly” (because they’ve already been acquired and are sitting in the acq card buffer). This means the read loop will very quickly “catch up” by receiving that backlog of buffered packets, at which point it’ll be back to receiving a packet “live” every 1ms.

This does potentially mean that streams may not benefit after all, as we’re ultimately at the mercy of the 1ms hardware trigger, which is throttling the data throughput. Having said that, my example of having to copy 200 packets before processing them is an “extreme” scenario. As I’ve hinted at previously, various aspects of this application can be configured, e.g. packet size, trigger frequency, and even how many packets are transferred to the GPU before processing (which can be as low as 1). Transferring and processing just one is perhaps a better fit for streaming (ABCABC…), but there are so many variables that I suspect it’ll be difficult to find a “one size fits all” solution, especially regarding streams.

Then use two streams, each one responsible for transfer and full processing and returning of the results of 200 packets. Put in some synchronization between them, that the second stream starts transferring only, when the first had finished (and vice versa). You can also use host callbacks within the streams to communicate with your acquisition card.
Do not use the default stream (0) for one of those two streams, because it has implicit synchronizations, where you do not want them.

Perhaps I’ve been following too many tutorials/examples where they always have one stream for data transfers and one for kernels, to overlap the two types of operation. Your suggestion sounds more reminiscent of OpenCL queues and “double buffering”, so I’ll look into that one.

I think my main concern is that the H2D transfers need to be synchronised with the wait…() API call, to ensure that it copies the packet in the host buffer at that moment in time. By moving to an async/streams based approach, you presumably lose any guarantee of when the H2D transfer will happen - the acquisition card is continually overwriting the host buffer with a new packet (every 1ms), so when the H2D finally happens there’s a good chance it will copy a completely different packet than the one present when cudaMemcpyAsync() was called. And in fact this is what I was seeing before I added the “synchronise” seen in my earlier pseudo-code. I suspect a side effect of doing this is that I’ve just reverted the read loop to synchronous operation, undoing any benefits of using streams. I’m starting to think I can’t use streams in this application due to the synchronous nature of how the packets arrive.

Decoupling the transfer of data from acquisition device to host buffer, and processing a host buffer (potentially with GPU acceleration) seems to be easily decoupled via double buffering (as used in graphics) or use of a BD ring (as used in networking equipment).

Whether GPU acceleration makes sense will depend on the cost of host->device data transfer over a relatively slow PCIe interconnect relative to the cost of buffer content processing. That is a function of CPU used, the system memory characteristics, the PCIe version and PCIe link width available, and the GPU used.

For it you can use cudaStreamAddCallback and put the waitForBuffer there. You just have to ensure with events between the two streams, which stream is currently receiving, so not both try to receive at the same time.