Our desktop application uses a 3rd-party high speed data acquisition PCI card that acquires a stream of data and sends it to the host app on a periodic basis - for the purpose of this example let’s say this is 4Mb every 1ms.
Our app employs a “read loop” that waits for a signal from the PCI card to say that it has written to that host buffer, which we then cudaMemcpy to a GPU buffer. The read loop will repeat this process multiple times, each time appending the new block of acquired data to that GPU buffer. Depending on different factors/settings, it might repeat this 3 or 4 times, or might be 300-400 times. For the purpose of this example let’s say it does 200 transfers.
Once the required number of blocks have been appended to the GPU buffer, the read loop will then run four processing kernels. These must be run sequentially, as each one “transforms” the acquired data in some way, ready for the next kernel to act on. Once those kernels have finished, a D2H is used to transfer a small amount of “results” data (a few tens of Kb at most) back to the host. The process then repeats, and the entire acquisition might run for just a few seconds or as much as 20 minutes.
I thought streams would make perfect sense here, i.e. run the kernels on one stream, while the read loop is going around waiting for, and transferring, the next 200 blocks of data on another stream. For the purpose of this example, lets say each of the four kernels takes 10ms, so if each of the 200 data blocks arrives at 1ms intervals then I’d expect the kernels to overlap the first ~20% of those data transfers.
Having made the necessary code changes, I’m not seeing any overlapping in NSight. I can confirm that all host buffers are pinned. After further reading up on streams, it seems that you don’t simply enqueue the operations sequentially, as we’re effectively doing, and I’ve seen numerous mentions of “breadth first”, although I’m still struggling to understand some of the concepts, and causes of blocking (like [this one] (https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf), from slide 15 onwards).
I’m now thinking I have been naive in my understanding of how streaming works. What I’m still not entirely clear on is whether those two streams are totally independent, or is each action like a “slot” whereby in one slot the GPU will execute a D2H and a kernel, then in the next slot execute the next D2H and kernel, and so on, with the time taken in each slot equal to the longest of the two operations? If this is true then it presumably means the four kernels would overlap with the first four D2H operations (effectively taking 10ms each), before the remaining 196 data transfers complete as normal?
I’m starting to think that streams isn’t suitable for this application, especially due to the read loop having to wait for a signal from the PCI card before it can do the H2D transfer?