Problems with Reduction Kernel

Hello,

I found this reduction example from an official NVIDIA pdf sheet:

__global__ void reduce0(int *g_idata, int *g_odata) {
	extern __shared__ int sdata[];
// each thread loads one element from global to shared mem
	unsigned int tid = threadIdx.x;
	unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;

	sdata[tid] = g_idata[i];
	__syncthreads();
// do reduction in shared mem

	for (unsigned int s = blockDim.x / 2; s > 0; s >>= 1) {
		if (tid < s) {
			if (sdata[tid] < sdata[tid + s]) {
				sdata[tid] = sdata[tid + s];
			}

		}
		__syncthreads();
	}

        // write result for this block to global mem
	if (tid == 0)
		g_odata[blockIdx.x] = sdata[0];
}

My Problem now is that I don’t have int values to be compared. I have two-dimensional array of float values.

But If I try to change

extern __shared__ int sdata[];

to

extern __shared__ int sdata[][];

I already have my first problem. The compiler complains by saying:

error: an array may not have elements of this type

In other words I would like to build something like this one:

__global__ void reduce0(float g_idata[PARTICLES][DIMENSIONS], float g_odata[DIMENSIONS]) {
	extern __shared__ float sdata[][];

	unsigned int tid = threadIdx.x;
	unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
	//printf("blockDim.x = %d
", blockDim.x);
	if (i < PARTICLES) {
		for (int d = 0; d < DIMENSIONS; d++)
			sdata[tid][d] = g_idata[tid][d];

		__syncthreads();

		//do reduction in shared mem
		for (unsigned int s = blockDim.x / 2; s > 0; s >>= 1) {
			if (tid < s) {
				if (getFitness(sdata[tid + s]) < getFitness(sdata[tid])) {
					for (int d = 0; d < DIMENSIONS; d++)
						sdata[tid][d] = sdata[tid + s][d];
				}
			}
					__syncthreads();
		}

	}
        // write result for this block to global mem
	if (tid == 0)
		//g_odata[blockIdx.x] = sdata[0];
		for (int d = 0; d < DIMENSIONS; d++)
			g_odata[d] = sdata[0][d];
}

But this doesn’t work because of the described error.

So how to solve this problem?

Regards
sw

You can only definde the shared memory as a 1D array if it is extern. You can retrieve the elements by changinf theindex from [i][j] to [i+j*lx]. If you want a 2D shared array remove the extern atribute and declare the array as static.

Hello,

declaring the variable as static leads to this compile error:

error: "static" is not allowed within a __device__ or __global__ function

I’ve got a problem with mapping [i][j] to [i + j*lx] because then my if-condition would look like this:

if (getFitness(sdata[tid * DIMENSIONS + d + s]) < getFitness(sdata[tid * DIMENSIONS + d])

but then getFitness gets a float value as parameter but it awaits a float-array as parameter. getfitness() is defined this way:

__device__ __host__ float getFitness(float position[])
{
  int i;
  float root;
  float squaresum = 0;
  for (i = 0; i < 3; i++) {
    squaresum += pow(((1000 - 0 / 2) - position[i]),2);
  }
  root = (sqrt(squaresum));
  return root;
}

So what can I do?

Declaring shared memory statically does not mean you need to use the word static.
It only means that the shared memory is declared inside the kernel.

Please remove static and have a declaration somewhat like shared float sdata[M][N].

I hope it helps.

Thanks for clearing up. M, N have to be known at the compiling time. I usually use # define M 32 or just put the number in.

Ok, thanks.

As I read on stackoverflow forum that you use extern when allocating shared memory dynamically. And you use it without extern if you want to allocate it static. You need to use extern if you want to use dynamically shared memory space in a kernel because you cannot allocate the memory directly in the kernel function? Am I right?

Regards
sw

Yes, you are right.
You only specify the size of the shared memory to be used in the kernel, from the host.
But, I would prefer to use static shared memory over dynamic.

Also for static shared memory you can pass it as a template parameter, e.g.

template<typename T, int SMEM_SIZE> __global__
void kernel(...)
{
    __shared__ T smem[SMEM_SIZE];

    ... 
}