I’m after some more guidance if possible please. I’ve posted about other aspects of this app recently, but this issue relates to a different kernel (so reading my other posts isn’t needed!).
The app application transfers a series of “records” into GPU memory, which arrive from an external data acquisition PCI card. Once there are enough records in the buffer, we run a series of kernels to process that data.
The number of values in the records (“record length”) is user configurable but typically around 45,000. They are short types. The number of records transferred into the buffer before being processed is user configurable, as is the record length. The process then repeats, i.e. transfer X records, process, transfer, process, etc., often for many minutes.
The problem is this: if we transfer 20,000 records to the GPU before processing, the kernel in question (which I’ll come to below) takes 14ms to execute. If we double the number of records to 40,000 but keep all other settings the same, then the kernel execution time jumps to over 400ms.
The kernel in question is responsible for summing the records (“accumulating” is the terminology we use). The number of records to accumulate is user configurable but almost always in single digits. E.g. say we have 20,000 records in the buffer and ‘num accumulations’ = 2. The kernel will therefore sum the first two records in the buffer together, the next two records together, and so on. The summed values are written to an “output” buffer, with the end result being 10,000 new “accumulated” records.
I’m assuming the poor performance can be attributed to the way a thread has to make large strides across the buffer as it sums its designated value from each record, negating any benefits of memory coalescing.
Here’s the kernel code. The launch parameters are:
<<<number of records in buffer / num accumulations, 1024>>>
__device__ void FirstPassAccumulation(
short* in_buff,
float* out_buff,
size_t record_len,
size_t num_accums)
{
// Which "group" of N neighbouring records to sum (where N=num accums)
int groupIdx = blockIdx.x;
// Value to sum (ultimately covering the range 0 to reclen-1 using striding in the "while" loop below)
int valueIdx = threadIdx.x;
int groupOffset = groupIdx * num_accums;
int accumulatedRecordOffset = groupIdx * record_len;
while (valueIdx < record_len)
{
int sum = 0;
for (size_t na = 0; na < num_accums; na++)
{
sum += in_buff[((groupOffset + na) * record_len) + valueIdx];
}
out_buff[accumulatedRecordOffset + valueIdx] = sum;
valueIdx += blockDim.x;
}
}
(The “output” buffer is a float type, as later kernels will perform further processing on this data).
The kernel does work by the way - the issue here is the exponentially increasing time as the number of records increases. I did think about splitting up the data, e.g. running this kernel after every (say) 1000 records, rather than wait until we have all 20,000. At this stage I’m really after a sanity check of the code, in case there are opportunities to optimise it, before I start at more drastic refactoring.