Looking for kernel performance suggestions

I’m after some more guidance if possible please. I’ve posted about other aspects of this app recently, but this issue relates to a different kernel (so reading my other posts isn’t needed!).
The app application transfers a series of “records” into GPU memory, which arrive from an external data acquisition PCI card. Once there are enough records in the buffer, we run a series of kernels to process that data.
The number of values in the records (“record length”) is user configurable but typically around 45,000. They are short types. The number of records transferred into the buffer before being processed is user configurable, as is the record length. The process then repeats, i.e. transfer X records, process, transfer, process, etc., often for many minutes.

The problem is this: if we transfer 20,000 records to the GPU before processing, the kernel in question (which I’ll come to below) takes 14ms to execute. If we double the number of records to 40,000 but keep all other settings the same, then the kernel execution time jumps to over 400ms.

The kernel in question is responsible for summing the records (“accumulating” is the terminology we use). The number of records to accumulate is user configurable but almost always in single digits. E.g. say we have 20,000 records in the buffer and ‘num accumulations’ = 2. The kernel will therefore sum the first two records in the buffer together, the next two records together, and so on. The summed values are written to an “output” buffer, with the end result being 10,000 new “accumulated” records.
I’m assuming the poor performance can be attributed to the way a thread has to make large strides across the buffer as it sums its designated value from each record, negating any benefits of memory coalescing.

Here’s the kernel code. The launch parameters are:
<<<number of records in buffer / num accumulations, 1024>>>

__device__ void FirstPassAccumulation(
    short* in_buff,
    float* out_buff,
    size_t record_len,
    size_t num_accums)
{
    // Which "group" of N neighbouring records to sum (where N=num accums)
    int groupIdx = blockIdx.x;
    // Value to sum (ultimately covering the range 0 to reclen-1 using striding in the "while" loop below)
    int valueIdx = threadIdx.x;

    int groupOffset = groupIdx * num_accums;
    int accumulatedRecordOffset = groupIdx * record_len;  
  
    while (valueIdx < record_len)
    {
        int sum = 0;
        for (size_t na = 0; na < num_accums; na++)
        {
            sum += in_buff[((groupOffset + na) * record_len) + valueIdx];
        }
        out_buff[accumulatedRecordOffset + valueIdx] = sum;

        valueIdx += blockDim.x;
    }
}

(The “output” buffer is a float type, as later kernels will perform further processing on this data).

The kernel does work by the way - the issue here is the exponentially increasing time as the number of records increases. I did think about splitting up the data, e.g. running this kernel after every (say) 1000 records, rather than wait until we have all 20,000. At this stage I’m really after a sanity check of the code, in case there are opportunities to optimise it, before I start at more drastic refactoring.

What is the possible range of values fornum accumulations?

I don’t understand why the two loops are ordered in this fashion. I would have expected record_len threads collectively doing fully coalesced reads of the records, with thread i responsible for summing item i of consecutive records in groups of num_accumulations. With 45,000 items per record the GPU would be kept fully busy.

Perhaps some L2 cache issue? Can you use Nsight Compute to check? Although you do not load values repeatedly.

For coalescing make sure that the values in in_buff are aligned, e.g. if record_len is divisible by 32 (or 64 for short). If it is short, also consider using short2 instead.

num_accumulations is nearly always in the single digits, e.g. 2 in this example. It can very rarely be set as high as 15 or so.

Regarding the grid/block design, my understanding is that the block size is limited to 1024 in Cuda (in a previous OpenCL implementation we did have the block size (or its equivalent) set to record_len, which simplified the kernel somewhat).
I set the grid size to (num records / num accumulations), so in the above example would be 10,000. It makes the code a little more understandable, in that one Cuda block is responsible for summing the values across a group of N records (where N=num_accumulations).
Hope that clarifies things, but apologies if my understanding of kernel design is limited.

Obviously one would use more than one thread block to process records of 45,000 items per record. For example, one could run 352 blocks of 128 threads each. Recall the standard CUDA pattern for processing a 1D array.

    int stride = gridDim.x * blockDim.x;
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    for (int i = tid; i < len; i += stride) {
        dst[i] =  .... src[i];
    }

At least it is my understanding at this point that each record is essentially a 1D array of equal-sized items. If that is not the case I would suggest clarifying this aspect.

It seems to me that the code as-is has a poor memory access pattern, and it would have a close to optimal access pattern when processed in the way I suggested. It is perfectly fine for individual threads to access with a large stride when they collectively make fully coalesced accesses to memory (the “base + tid” pattern).

I see how your example relates to the record length, and you are right that each record is essentially a 1D array. I’m just not sure how to adapt/scale up that “standard pattern” in my scenario, where there are thousands of records in the buffer that need to be divided into groups of N records for accumulating.

Assuming “num_accumulations” is 2 then a given thread needs to sum the Nth value from 2 neighbouring records within the buffer (where N=0 to reclen-1). In my earlier example where there are 20,000 records in the buffer, these would be logically split up into 10,000 groups, each containing 2 records that need to be be summed/accumulated.

The kernel’s grid size is currently set such that it maintains that concept of a group of records to sum, but it does mean it isn’t true “grid striding” (I’m only striding by blockDim.x), so perhaps this is my problem. What if I changed the launch parameters to something like this:

dim3 group(num_records_in_buffer / num_accumulations, 352);
kernel<<<group, 128>>>...

Here, I’ve taken your example of 352 blocks and 128 threads (for coverage of a reclen of 45000), while the 2D group size means I maintain that earlier concept of “groups” of records to sum, while also being able to properly “grid stride” using “gridDim.y * blockDimx”.

Here is pseudo code for the kernel that I am envisioning.

int tid = blockDim.x * blockIdx.x + threadIdx.x;
if (tid < recordLen) {
    resIdx = 0;
    for (int j = 0; j < numRecords; j += numAccums) { 
        sum = 0;
        for (int jj = j; jj < (j + numAccums); jj++) {
            sum += in_record[jj].item[tid];      /// <<<<<< coalesced reads
        }
        out_record[resIdx].item[tid] = sum;  /// <<<<<< coalesced writes
        resIdx++;
    }           
}

Call it:

    dim3 dimBlock(128);
    int threadBlocks = (recordLen + (dimBlock.x - 1)) / dimBlock.x;
    dim3 dimGrid(threadBlocks);
    kernel<<<dimGrid,dimBlock>>>(args ...)

The input and output streams are (except for end cases) fully coalesced, as consecutive threads read from / write to consecutive words. Code written in browser, way beyond my normal bedtime, so buyer beware.

Thanks for the suggestions/pointers - I think I see how you’ve turned the “nesting” around to improve the coalescing. I’ll digest it full though and give it a try, then post the outcome.

If the read data is of short type, use short2 and let each thread process two data items at the same time. Nvidia GPUs are optimized for 32 bit memory access and processing per thread.

1 Like

Mixed results after making this change, and not as good as I had perhaps hoped for.

In all tests below the reclen is 47296 and num accumulations is 2. (Other configurable aspects of the app can affect how many records end up in the buffer - sometimes dramatically, as seen in the first test).

Num records in buffer=40: kernel execution time reduced from 44us to 35us.
Num records in buffer=20,000: kernel execution time increased from 15 to 26ms.
Num records in buffer=40,000: kernel time decreased from 414ms to 366ms.

I’d say it is high time to break out the profiler. Given that there is hardly any computation and no data re-use, performance should only be limited by memory throughput, so focus on stats related to utilization of the memory subsystem.

If the values of user-configurable parameters can vary widely, you may wind up with multiple kernels covering the various corners of the configuration space, and may need to design a heuristic or calibration run to pick the fastest kernel. If you look at something like *GEMM in CUBLAS, there are actually dozens of different kernels working under the hood.

Sanity check: Performance work must use release builds with full optimizations. That is what you are using, correct?

Thanks for the reply. Are you referring to NSight “Compute” rather than “System” for this kind of thing? I haven’t done much with the former, so is there anything I should look out for?

I believe there is still some scope to optimise aspects of the code, e.g. in that first example we’ll be able to append many such 40 record “data sets” to the GPU buffer before processing them, to better utilise the threads.

Regarding your last point about optimisations, I’ll have to check with our software engineer who isn’t around today, but I believe he’s been running the tests on a release build of the C++ host application. Is this what you mean, or are you referring to Cuda’s own optimisation settings? I assume this would be the “nvcc optimization flags”, which apparently default to O3 in Visual Studio? It’ll probably mean more to our engineer so I’ll ask him to look into it, but are there any particular flags that we can look into?

The app is running on Windows and using a Quadro RTX 4000, if that is of any relevance.

Yes, I was thinking of Nsight Compute. I am not a profiler expert, there is a separate subforum for the profiler.

While I have a workstation with Windows 10 and a Quadro RTX 4000 here, I never use the Visual Studio IDE and don’t know what the knobs look like for CUDA builds. I build strictly via the command line where I am dealing directly with nvcc and msvc command line switches.

The device-side compiler defaults to building with full optimization, but it happens occasionally that we are going through N iterations of a performance issue in a forum thread, the perf numbers don’t seem to quite add up and we discover belatedly that OP was running with a debug build, which implies that all optimizations are disabled for device-side compilation. So best to check on that earlier rather than later.

1 Like