Looking for performance tweak suggestions

My application transfers a series of “records” into GPU memory (from an external data acquisition PCI card), which are then processed by a kernel. Each record typically contains ~45,000 float values, and we’ll typically transfer thousands of records to the GPU before running the kernel (record length and number of records to transfer are dependent on user settings in our desktop app).
Additionally, the user will define around 200 or so “regions”, used to highlight areas of interest within the records. The “width” of the regions isn’t uniform and can vary between 20 and 150 data points. Regions can be positioned anywhere within the record (but not overlap), so aren’t uniformly spaced.

The processing kernel launch settings are <<<number of records in buffer, number of regions>>>, so each thread is responsible for processing one region in one record.
The kernel code starts by summing a number of points either side of the region (the number of points to sum either side is user-configurable, but ~10-20 either size), then averages these to give a “noise” value. The code then iterates over each value in the region proper, subtracting the noise value and writing this corrected value out to a second buffer. Once complete, we’re essentially left with a copy of the original buffer but with “noise corrected” regions. The kernel doesn’t do anything with the values between the regions so these remain “uninitialised” in the output buffer. This buffer is then passed to a second kernel for further processing.

The process then repeats, i.e. transfer another X thousand records, process, and so on. These runs can last many minutes.

Example kernel execution time: 21ms with a record length of ~47,000 float values, 210 regions, and transferring 10,000 records to the GPU before processing. The kernel time increases exponentially though: doubling the number of records to 20,000 increases the time to 400ms! This is after we were able to make substantial improvements using async pre-fetching in the “for” loop.

In my (very) limited understanding of Cuda, I assume the problem is due to uncoalesced global memory access, given the nature of what the kernel is doing. Some numbers from NSight: “Uncoalesced Global Accesses” (estimated speedup 85%), “Long Scoreboard Stalls” (estimated speedup 50%), and “L1TEX Global Store Access Pattern” (estimated speedup 48%).

The question is: what can I do to improve performance, if at all? I get the feeling that the nature of our processing precludes us from fully utilising memory coalescing, especially given the variability of so many aspects of this (record len, num regions, region widths and locations). Any pointers/suggestions would be much appreciated. I’m still getting up to speed on Cuda so there might be ideas and concepts that I’m not aware of.

We’re currently running this on an AMD Quadro RTX4000 by the way.

It would be good if you could share the kernel code, rather than describing it. Or even better, a minimal reproducer which can be executed and run to measure the performance. For example with your described inputs.

Example kernel execution time: 21ms with a record length of ~47,000 float values, 210 regions, and transferring 10,000 records to the GPU.

For starters, I would suggest using a full warp per region

Here’s the pertinent part of the kernel:

__shared__ float v[513]; // Larger than max number of regions but also a power of 2, +1. Making it a power of 2 + 1 optimises the memory apparently

// Function to perform the noise correction (baseline removal)
__device__ void noise_correct(
    const float* regionData,
    float* resultData,
    const int bufferLength,
    const float baseline)
{
    // First and last value in the "processed" region will always be 0 for reasons I won't go into
    resultData[0] = 0.0;
    resultData[bufferLength - 1] = 0.0;

    // Async pre-fetch request for the first value in the region
    __pipeline_memcpy_async(&v[threadIdx.x], &regionData[1], sizeof(float));
    __pipeline_commit();

    int count = 0;
    int last = bufferLength - 1;
    for (int i = 1; i < last; ++i)
    {
        // Wait for the previous async pre-fetch to complete before taking the value from the shared array.
        __pipeline_wait_prior(0);
        float value = v[threadIdx.x] - baseline;
        
        // Async pre-fetch request for the next value in the region to be processed.
        __pipeline_memcpy_async(&v[threadIdx.x], &regionData[i + 1], sizeof(float));
        __pipeline_commit();

        resultData[i] = value;
    }
}

// Kernel entry point
__global__ void NoiseCorrectionKernelCore(
    float* first_pass_accum_buffer,
    float* noise_corrected_buffer,
    size_t numRegions,
    int record_len)
{
    int record_number = blockIdx.x;
    int region_number = threadIdx.x;
    
    // Notes:-
    // 'startIndex' is the location of the region within the buffer. 'bufferLength' is the number of values 
    // in the region. Calculation of these removed for brevity, but basically comes from a lookup table using the
    // above integers.
    // 'baseline' is the calculated baseline as detailed in my OP. I've removed this part of the kernel for brevity, 
    // as the baseline removal above is the pertinent area.

    // Pointer to the start of this thread's region, in both the "source" buffer and the "results" buffer.
    const float* regionData = &(first_pass_accum_buffer[startIndex]);
    float* resultData = &(noise_corrected_buffer[startIndex]);

    noise_correct(regionData, resultData, bufferLength, baseline);
}

If the records were smaller then I guess I could have copied the record into a shared array, then have the “for” loop read the values from that. I don’t suppose there’s a way around this given the large record sizes?

I’d thought about dividing the list of regions into (say) three and running the kernel three times. That way I could load that third of the record into shared memory. However this would complicate the code, plus its not clear whether it would improve memory access performance, and if so, would this be cancelled out by reducing the parallelism?
Although I gave the earlier example of processing 10,000 records, it’s worth mentioning that during some runs this figure could be in single digits. Record length can occasionally vary too - as low as ~6000 floats up to a max of ~65,000 floats.

Is there any benefit in changing the read loop to use vectors here, to potentially read 2, 4, or more, values “simultaneously”? From what I’ve read, vectors don’t seem to result in very big gains.

Just trying to throw some ideas out there really.

I would try to process each record by a warp. Your memory access times are much more critical than your floating point computation time. warps better coalesce memory accesses and are more flexible, if there are different sizes of regions (if a single thread quits, the warp diverges; if a warp quits, it just does not participate any longer in scheduling). Also synchronization between threads of a warp is simpler and they can share data with shuffle or copies to shared memory.

Also warps could work on several regions in a for loop, so they do not quit so early.
Then for even better optimization e.g. thread 0…7 could work on the last 8 data points of the previous region and at the same time threads 13…15 on the first 3 data points of the next region. Align memory accesses either by 32 bits or by 128 bits and keep the other threads empty.

I wasn’t quite sure what you meant by processing each record by a warp (and suspect you possibly meant each region?). I’m vaguely aware of how warps improve memory coalescing though, so it gave me some inspiration to refactor the kernels. I now use these launch parameters:

dim3 group(number of records in buffer, number of regions);
NoiseCorrectionKernelCoreHp <<<group, 32>>> 

Now, rather than each thread iterating over an entire region, each thread now “baseline-subtracts” one value in the region assigned to that block. It then uses striding (by 32) until all values in the region have been processed. This has given me an average 6-10x speed up of kernel execution time. Thanks for pointing me in the right direction on this, and gives me a basis to experiment further.

1 Like

Yes, that is what I meant, sorry for the typo.
6-10x speed up is nice!

BTW I just saw @striker159 suggesting the same:

You can then use the shuffle instructions or shared memory to combine the results of the 32 threads.

It is more optimal, if you enforce alignment to 256 bits (32 bytes = 8 * 4) or 1024 bits (128 bytes = 32 * 4).
Let us do 1024 bits. Assuming you read values of 4 bytes.

Instead of

int startindex = 207; // given as parameter
int lane = threadIdx.x;
for (int i = startindex + lane; i < startindex + regionlength; i += 32)
    processElement(memory[i]);

you do

int startindex = 207; // given as parameter
int alignedstart = startindex & ~0x1F; // next lower aligned index; use ~0x07 for 256 bits
int lane = threadIdx.x;
for (int i = alignedstart + lane; i < startindex + regionlength; i += 32)
    if (i >= startindex) // can only be false in the first iteration. You can put the first iteration outside of the for loop to not have to process the if each iteration
        processElement(memory[i]);

This optimizes coalescing for the memory accesses of the warps. (In the first variant, within one warp some reads could read elements from before the alignment boundary, and some from afterwards.)

You can then in a next step optimize further by combining the memory accesses between first and last loop iteration, which each are done only for some threads, or do so for different regions. But if your kernel is memory-bound, this probably would not speed up your code any more.