Why the periods of inactivity in the timeline?

My app processes data from a 3rd-party data acquisition PCI card connected to a scientific instrument. The card continually captures records/spectra, buffers (say) 40 at a time (each rec is ~100kb), then writes these to a pinned “transfer” buffer. (I don’t think it’s relevant here, but our C++ code calls a method in the card’s API that blocks until the transfer buffer has been filled).
Our code then copies this transfer buffer to the GPU, executes a number of kernels to process these records, then finally copies a small “results” buffer back to the host. This process repeats until all records have been acquired and processed. A run typically lasts tens of seconds to several minutes.
The results copied to the host are stored in a queue, and a b/g thread in a C# application periodically writes these to file. I’d estimate it writes around 20Mb a second.

When viewed in NSight System, what should be a continual process of acquiring and processing data looks like this, with “gaps” at consistent 0.5s intervals:


(This is very zoomed out by the way - the H2D transfers are typically ~300us each and execution of kernels ~170us).

I’m not sure what is causing those gaps. It’s not the acquisition card as this acquires records using a hardware trigger on the instrument.
The host application uses a timer to save results from the queue to file, but this runs at 250ms, while the intervals in the timeline are 0.5s +/- 1ms – too accurate for a software timer I believe.
The host app periodically starts a new “results” file, so I’m wondering if the gaps correspond to when the previous file is flushed/closed. I’m still doing the math on this as it closes a file after N results have been written. Would something like this affect the GPU side of things, given that it’s happening on a separate thread?

I’ve looked through the remaining timeline rows but can’t see any activities that coincide with these gaps. Is there anything I should be looking for? I’m still getting to grips with NSight so I don’t know if there is an easy way to look for things like this without having to expand hundreds of process/thread rows.

Lastly, on a related note, the results files are written to an M.2 drive. As this is connected directly to the PCI bus, could this have an adverse effect on cudaMemcpy performance, and if so should we be using a SATA SSD instead?

PCIe is not a bus. It is based on point-to-point connections, most commonly between a “root complex” (typically part of the host system’s CPU) and “endpoints” (such as the GPU). In a properly designed host system, there will be a sufficient number of PCIe lanes, of which each GPU ideally takes 16, with some lanes left over for I/O devices such as bulk storage. Given the massive performance disparities between NVMe and SATA SSDs, use of an NVMe SSD is definitely indicated for any high-performance system.

As for the gaps in the timeline: (1) Are there application-level performance issues (e.g. throughput, jitter) in this use case that make it mandatory to root cause these gaps? (2) What kind of instrumentation have you tried to find out where those gaps occur with respect to other system activity? I assume analysis has shown that there is no user-scheduled activity whatsoever that takes place with a half-second cadence?

The idea with instrumentation is to find an activity that (1) shows up in logged output (of which profiler timeline would be one example) (2) has no (or at least no significant) side effect (3) and can be inserted at various levels of a software stack. A classical example from embedded system would be writes to a specific hardware port. This would allow you to narrow down what happens just before or just after these gaps occur, which in turn would likely put you on the right investigative path.

Am I correct to assume that the acquisition card uses a double-buffering approach, or a more elaborate system such as a BD ring, to effectively decouple the sample acquisition process from communication with the rest of the system?

[Later:] I forgot to ask the obvious. How do we know these gaps don’t represent genuine idle time, that is, there is no activity since there is nothing to do? As a corollary: Does gap duration and cadence change at all when you dial the data transfer rate from the acquisition card up or down?

To answer your last question, the acquisition card does employ some form of internal buffer, so it is able to buffer the records for a limited time if the “consuming” app isn’t able to service the incoming data, so those “gaps” don’t appear to cause any problems such as missing records/data (but I suspect would if the run was long enough).

To expand on my explanation of how the acquisition process works, several aspects of the run are user-configurable. One such setting affects how many transfer buffers are copied to the GPU before processing, so it’s not always a 1:1 as explained above. In some use cases we will perform multiple transfers, appending these to a larger GPU buffer, before running the kernels. One such real world example performs 500 transfers, taking a total of ~400ms, before running the kernels to process those 20k record (total time ~100ms). Here, there are no gaps in the timeline, suggesting it’s not an “external” (PC/software) cause, which would presumably show up every time.
In the use case where we see the gaps, we’re obviously generating more results more frequently (one ever ~470us), versus the latter example where it’s only once every 500ms or so, so my gut feeling is that the problem is related to how the host saves these to the queue or writes them to disk.
It doesn’t feel like the acq card either - in both examples this is still acquiring records at the same frequency and sending them to us in the same 40-record transfer buffers.

It sounds like I need to start profiling other aspects of the host application, rather than rely on NSight for tracking down the cause of this one, although I was hoping this might have provided some insight into what the host application or file system might have been doing during these gaps.

This is something I overlooked previously. I am not familiar with C#. Does it use garbage collection at all? Even if it did, it would seem strange to see that kick so frequently and at such an exact schedule of half-second intervals.

I think most users in this forum are located in North America, where it is still night time; 3:28am where I am. So you might want to wait for useful responses for 24 hours before diving into additional work. It is likely that there are forum participants with recent experience interfacing with ASICs (something I last did that 25 years ago, and I dealt with data acquisition hardware in the 1990s), and with much better knowledge of profiler features (there might be good ways to tease the desired information from Nsight, but off the top of my head I don’t know what they would be).

1 Like