I have a code that is originally CPU-serial and it involves the sum of all elements of the array to be used as a normalization factor. Breaking down the problem results in 3 or 4 kernel functions, one of them is this sum which can be solved by the array reduction approach.
In generic terms, the function that calls these kernels is defined as (error checking omitted for easier reading):
// Declare and allocate needed space
float *array_in, *array_out;
cudaMallocManaged(&array_in, array_len * sizeof(float));
cudaMallocManaged(&array_out, array_len * sizeof(float));
// Assume array_in is filled here
// Call the kernels
kernel_1 <<<NUM_BLOCKS, NUM_THREADS>>> (array_in, array_out, array_len);
cudaDeviceSynchronize();
reduction <<<NUM_BLOCKS, NUM_THREADS>>> (array_in, array_len);
cudaDeviceSynchronize();
kernel_2 <<<NUM_BLOCKS, NUM_THREADS>>> (array_in, array_out, array_len);
cudaDeviceSynchronize();
// Do something else with data and then free the memory
cudaFree(array_in);
cudaFree(array_out);
I will need the array reduction result to call kernel_2(). I am now working on getting the reduction into my program as it is already a tested and proven code and no need to reinvent the wheel. The document I used as base (copy/paste, if you will), is: https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf
Slide 35 has the final (7th) kernel implementation, which I paste here just in case we need to reference to specific lines of the code:
==== SLIDE 35 of CUDA REDUCTION BY MARK HARRIS ====
template <unsigned int blockSize> __global__ void reduce6(int *g_idata, int *g_odata, unsigned int n)
{
extern __shared__ int sdata[];
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockSize*2) + tid;
unsigned int gridSize = blockSize*2*gridDim.x;
sdata[tid] = 0;
while (i < n)
{
sdata[tid] += g_idata[i] + g_idata[i+blockSize];
i += gridSize;
}
__syncthreads();
if (blockSize >= 512)
{
if (tid < 256) { sdata[tid] += sdata[tid + 256]; }
__syncthreads();
}
if (blockSize >= 256)
{
if (tid < 128) { sdata[tid] += sdata[tid + 128]; }
__syncthreads();
}
if (blockSize >= 128)
{
if (tid < 64) { sdata[tid] += sdata[tid + 64]; }
__syncthreads();
}
if (tid < 32)
{
if (blockSize >= 64) sdata[tid] += sdata[tid + 32];
if (blockSize >= 32) sdata[tid] += sdata[tid + 16];
if (blockSize >= 16) sdata[tid] += sdata[tid + 8];
if (blockSize >= 8) sdata[tid] += sdata[tid + 4];
if (blockSize >= 4) sdata[tid] += sdata[tid + 2];
if (blockSize >= 2) sdata[tid] += sdata[tid + 1];
}
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
Then I looked at the CUDA sample code reduction.h/.cpp which comes in the installation, just to see the context needed for this code to run, and noticed that the sample code isn’t quite the same implementation in this referred presentation, correct? The first difference I see is, while Mark Harris aims for a generic implementation, the CUDA sample code considers array sizes being powers of 2. In my case I will be using arrays like 2^32 or 2^33, but not necessarily always powers of 2. So I will stick to Mark’s code as this requirement is not implied.
In this case, can you enlighten be on a few aspects as it is not really a drop-in function?
- Where exactly is sdata coming from? What should its length be if we reduce an array of length N?
- Is sdata storing intermediate sums for the current kernel run? If so, what is g_odata storing in the end?
- In the template function definition, shouldn’t be an argument along with the other ones?
- How should the reduction kernel calls look like, considering that an array of 2^30 elements will require more calls than, say, 4096?
I appreciate any guidance you can provide.