My application transfers a series of “records” into GPU memory (from an external data acquisition PCI card), which are then processed by a kernel. Each record typically contains ~45,000 float values, and we’ll typically transfer thousands of records to the GPU before running the kernel (record length and number of records to transfer are dependent on user settings in our desktop app).
Additionally, the user will define around 200 or so “regions”, used to highlight areas of interest within the records. The “width” of the regions isn’t uniform and can vary between 20 and 150 data points. Regions can be positioned anywhere within the record (but not overlap), so aren’t uniformly spaced.
The processing kernel launch settings are <<<number of records in buffer, number of regions>>>, so each thread is responsible for processing one region in one record.
The kernel code starts by summing a number of points either side of the region (the number of points to sum either side is user-configurable, but ~10-20 either size), then averages these to give a “noise” value. The code then iterates over each value in the region proper, subtracting the noise value and writing this corrected value out to a second buffer. Once complete, we’re essentially left with a copy of the original buffer but with “noise corrected” regions. The kernel doesn’t do anything with the values between the regions so these remain “uninitialised” in the output buffer. This buffer is then passed to a second kernel for further processing.
The process then repeats, i.e. transfer another X thousand records, process, and so on. These runs can last many minutes.
Example kernel execution time: 21ms with a record length of ~47,000 float values, 210 regions, and transferring 10,000 records to the GPU before processing. The kernel time increases exponentially though: doubling the number of records to 20,000 increases the time to 400ms! This is after we were able to make substantial improvements using async pre-fetching in the “for” loop.
In my (very) limited understanding of Cuda, I assume the problem is due to uncoalesced global memory access, given the nature of what the kernel is doing. Some numbers from NSight: “Uncoalesced Global Accesses” (estimated speedup 85%), “Long Scoreboard Stalls” (estimated speedup 50%), and “L1TEX Global Store Access Pattern” (estimated speedup 48%).
The question is: what can I do to improve performance, if at all? I get the feeling that the nature of our processing precludes us from fully utilising memory coalescing, especially given the variability of so many aspects of this (record len, num regions, region widths and locations). Any pointers/suggestions would be much appreciated. I’m still getting up to speed on Cuda so there might be ideas and concepts that I’m not aware of.
We’re currently running this on an AMD Quadro RTX4000 by the way.