Looking for kernel performance suggestions

andrew.stephens · August 21, 2024, 10:33am

I’m after some more guidance if possible please. I’ve posted about other aspects of this app recently, but this issue relates to a different kernel (so reading my other posts isn’t needed!).
The app application transfers a series of “records” into GPU memory, which arrive from an external data acquisition PCI card. Once there are enough records in the buffer, we run a series of kernels to process that data.
The number of values in the records (“record length”) is user configurable but typically around 45,000. They are short types. The number of records transferred into the buffer before being processed is user configurable, as is the record length. The process then repeats, i.e. transfer X records, process, transfer, process, etc., often for many minutes.

The problem is this: if we transfer 20,000 records to the GPU before processing, the kernel in question (which I’ll come to below) takes 14ms to execute. If we double the number of records to 40,000 but keep all other settings the same, then the kernel execution time jumps to over 400ms.

The kernel in question is responsible for summing the records (“accumulating” is the terminology we use). The number of records to accumulate is user configurable but almost always in single digits. E.g. say we have 20,000 records in the buffer and ‘num accumulations’ = 2. The kernel will therefore sum the first two records in the buffer together, the next two records together, and so on. The summed values are written to an “output” buffer, with the end result being 10,000 new “accumulated” records.
I’m assuming the poor performance can be attributed to the way a thread has to make large strides across the buffer as it sums its designated value from each record, negating any benefits of memory coalescing.

Here’s the kernel code. The launch parameters are:
<<<number of records in buffer / num accumulations, 1024>>>

__device__ void FirstPassAccumulation(
    short* in_buff,
    float* out_buff,
    size_t record_len,
    size_t num_accums)
{
    // Which "group" of N neighbouring records to sum (where N=num accums)
    int groupIdx = blockIdx.x;
    // Value to sum (ultimately covering the range 0 to reclen-1 using striding in the "while" loop below)
    int valueIdx = threadIdx.x;

    int groupOffset = groupIdx * num_accums;
    int accumulatedRecordOffset = groupIdx * record_len;  
  
    while (valueIdx < record_len)
    {
        int sum = 0;
        for (size_t na = 0; na < num_accums; na++)
        {
            sum += in_buff[((groupOffset + na) * record_len) + valueIdx];
        }
        out_buff[accumulatedRecordOffset + valueIdx] = sum;

        valueIdx += blockDim.x;
    }
}

(The “output” buffer is a float type, as later kernels will perform further processing on this data).

The kernel does work by the way - the issue here is the exponentially increasing time as the number of records increases. I did think about splitting up the data, e.g. running this kernel after every (say) 1000 records, rather than wait until we have all 20,000. At this stage I’m really after a sanity check of the code, in case there are opportunities to optimise it, before I start at more drastic refactoring.

njuffa · August 21, 2024, 10:52am

What is the possible range of values fornum accumulations?

I don’t understand why the two loops are ordered in this fashion. I would have expected record_len threads collectively doing fully coalesced reads of the records, with thread i responsible for summing item i of consecutive records in groups of num_accumulations. With 45,000 items per record the GPU would be kept fully busy.

Curefab · August 21, 2024, 11:00am

Perhaps some L2 cache issue? Can you use Nsight Compute to check? Although you do not load values repeatedly.

For coalescing make sure that the values in in_buff are aligned, e.g. if record_len is divisible by 32 (or 64 for short). If it is short, also consider using short2 instead.

andrew.stephens · August 21, 2024, 11:01am

num_accumulations is nearly always in the single digits, e.g. 2 in this example. It can very rarely be set as high as 15 or so.

Regarding the grid/block design, my understanding is that the block size is limited to 1024 in Cuda (in a previous OpenCL implementation we did have the block size (or its equivalent) set to record_len, which simplified the kernel somewhat).
I set the grid size to (num records / num accumulations), so in the above example would be 10,000. It makes the code a little more understandable, in that one Cuda block is responsible for summing the values across a group of N records (where N=num_accumulations).
Hope that clarifies things, but apologies if my understanding of kernel design is limited.

njuffa · August 21, 2024, 11:09am

Obviously one would use more than one thread block to process records of 45,000 items per record. For example, one could run 352 blocks of 128 threads each. Recall the standard CUDA pattern for processing a 1D array.

    int stride = gridDim.x * blockDim.x;
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    for (int i = tid; i < len; i += stride) {
        dst[i] =  .... src[i];
    }

At least it is my understanding at this point that each record is essentially a 1D array of equal-sized items. If that is not the case I would suggest clarifying this aspect.

It seems to me that the code as-is has a poor memory access pattern, and it would have a close to optimal access pattern when processed in the way I suggested. It is perfectly fine for individual threads to access with a large stride when they collectively make fully coalesced accesses to memory (the “base + tid” pattern).

andrew.stephens · August 21, 2024, 11:42am

I see how your example relates to the record length, and you are right that each record is essentially a 1D array. I’m just not sure how to adapt/scale up that “standard pattern” in my scenario, where there are thousands of records in the buffer that need to be divided into groups of N records for accumulating.

Assuming “num_accumulations” is 2 then a given thread needs to sum the Nth value from 2 neighbouring records within the buffer (where N=0 to reclen-1). In my earlier example where there are 20,000 records in the buffer, these would be logically split up into 10,000 groups, each containing 2 records that need to be be summed/accumulated.

The kernel’s grid size is currently set such that it maintains that concept of a group of records to sum, but it does mean it isn’t true “grid striding” (I’m only striding by blockDim.x), so perhaps this is my problem. What if I changed the launch parameters to something like this:

dim3 group(num_records_in_buffer / num_accumulations, 352);
kernel<<<group, 128>>>...

Here, I’ve taken your example of 352 blocks and 128 threads (for coverage of a reclen of 45000), while the 2D group size means I maintain that earlier concept of “groups” of records to sum, while also being able to properly “grid stride” using “gridDim.y * blockDimx”.

njuffa · August 21, 2024, 11:50am

Here is pseudo code for the kernel that I am envisioning.

int tid = blockDim.x * blockIdx.x + threadIdx.x;
if (tid < recordLen) {
    resIdx = 0;
    for (int j = 0; j < numRecords; j += numAccums) { 
        sum = 0;
        for (int jj = j; jj < (j + numAccums); jj++) {
            sum += in_record[jj].item[tid];      /// <<<<<< coalesced reads
        }
        out_record[resIdx].item[tid] = sum;  /// <<<<<< coalesced writes
        resIdx++;
    }           
}

Call it:

    dim3 dimBlock(128);
    int threadBlocks = (recordLen + (dimBlock.x - 1)) / dimBlock.x;
    dim3 dimGrid(threadBlocks);
    kernel<<<dimGrid,dimBlock>>>(args ...)

The input and output streams are (except for end cases) fully coalesced, as consecutive threads read from / write to consecutive words. Code written in browser, way beyond my normal bedtime, so buyer beware.

andrew.stephens · August 21, 2024, 12:08pm

Thanks for the suggestions/pointers - I think I see how you’ve turned the “nesting” around to improve the coalescing. I’ll digest it full though and give it a try, then post the outcome.

Curefab · August 21, 2024, 1:14pm

If the read data is of short type, use short2 and let each thread process two data items at the same time. Nvidia GPUs are optimized for 32 bit memory access and processing per thread.

andrew.stephens · August 23, 2024, 7:57am

Mixed results after making this change, and not as good as I had perhaps hoped for.

In all tests below the reclen is 47296 and num accumulations is 2. (Other configurable aspects of the app can affect how many records end up in the buffer - sometimes dramatically, as seen in the first test).

Num records in buffer=40: kernel execution time reduced from 44us to 35us.
Num records in buffer=20,000: kernel execution time increased from 15 to 26ms.
Num records in buffer=40,000: kernel time decreased from 414ms to 366ms.

njuffa · August 23, 2024, 8:20am

I’d say it is high time to break out the profiler. Given that there is hardly any computation and no data re-use, performance should only be limited by memory throughput, so focus on stats related to utilization of the memory subsystem.

If the values of user-configurable parameters can vary widely, you may wind up with multiple kernels covering the various corners of the configuration space, and may need to design a heuristic or calibration run to pick the fastest kernel. If you look at something like *GEMM in CUBLAS, there are actually dozens of different kernels working under the hood.

Sanity check: Performance work must use release builds with full optimizations. That is what you are using, correct?

andrew.stephens · August 23, 2024, 9:19am

Thanks for the reply. Are you referring to NSight “Compute” rather than “System” for this kind of thing? I haven’t done much with the former, so is there anything I should look out for?

I believe there is still some scope to optimise aspects of the code, e.g. in that first example we’ll be able to append many such 40 record “data sets” to the GPU buffer before processing them, to better utilise the threads.

Regarding your last point about optimisations, I’ll have to check with our software engineer who isn’t around today, but I believe he’s been running the tests on a release build of the C++ host application. Is this what you mean, or are you referring to Cuda’s own optimisation settings? I assume this would be the “nvcc optimization flags”, which apparently default to O3 in Visual Studio? It’ll probably mean more to our engineer so I’ll ask him to look into it, but are there any particular flags that we can look into?

The app is running on Windows and using a Quadro RTX 4000, if that is of any relevance.

njuffa · August 23, 2024, 9:39am

Yes, I was thinking of Nsight Compute. I am not a profiler expert, there is a separate subforum for the profiler.

While I have a workstation with Windows 10 and a Quadro RTX 4000 here, I never use the Visual Studio IDE and don’t know what the knobs look like for CUDA builds. I build strictly via the command line where I am dealing directly with nvcc and msvc command line switches.

The device-side compiler defaults to building with full optimization, but it happens occasionally that we are going through N iterations of a performance issue in a forum thread, the perf numbers don’t seem to quite add up and we discover belatedly that OP was running with a debug build, which implies that all optimizations are disabled for device-side compilation. So best to check on that earlier rather than later.

Topic		Replies	Views
Looking for performance tweak suggestions CUDA Programming and Performance	5	63	August 12, 2024
Transfer-Bound Application Looking for ideas to speed it up CUDA Programming and Performance	36	29320	April 23, 2010
Cuda code performance CUDA Programming and Performance	14	3111	December 16, 2014
Urgent help with threads please! CUDA Programming and Performance	21	10785	March 6, 2008
Possible CUDA improvements CUDA Programming and Performance	7	6122	July 14, 2008
Newbie help on thread blocks CUDA Programming and Performance	22	10596	December 24, 2008
'for' loop performance hacks? CUDA Programming and Performance	17	10456	February 28, 2009
What can't you do in CUDA that you'd like? Requests for the future CUDA Programming and Performance	407	134557	May 26, 2010
How to choose how many threads/blocks to have? CUDA Programming and Performance	43	51370	June 7, 2022
Outrageous speedups over serial code: advice needed CUDA Programming and Performance	17	2800	November 18, 2012

Looking for kernel performance suggestions

Related topics