Reduction & block dimension Using the easiest reduction example of the SDK

Hello everyone !

I need to do a reduction for my program !

So I’ve read the doc of Nvidia about it (good paper btw) and now, I’m trying to do the same !

And obviously, it does not work !

I’m doing exactly the same thing than the first example of the SDK so I assume my mistake is about the ThreadPerBlock and/or the DimGrid I’ve choosen !

I’ve tried with lots of different values but the result is the same !

Here is my code :

// Kernel that executes on the CUDA device

__global__ void reduction(float *result_in,float*result_out)

{

  extern __shared__ float sdata[];

unsigned int s=1;

unsigned int tid = threadIdx.x;

  unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

  sdata[tid] = result_in[i];

  __syncthreads();

for(s=1; s<blockDim.x;s *=2)

	{

	  if (tid % (2*s) == 0) 

	  {

		  sdata[tid] += sdata[tid+s];

	  }

	__syncthreads();

	}

if(tid==0) result_out[blockIdx.x] = sdata[0];

}

int main()

{

float * d_result;

float * d_result1;

int W = 16;

float h_result[W];

float h_result1[W];

int i=0;

for (i=0;i<W;i++) h_result[i]=i;

	cudaMalloc((void**)&d_result,W*sizeof(float));

	cudaMalloc((void**)&d_result1,W*sizeof(float));

	cudaMemcpy(d_result,h_result,W*sizeof(float),cudaMemcpyHostToDevice); 

dim3 threadPerBlock(16,2);

  dim3 dimGrid(W/threadPerBlock.x , 1);

	reduction<<<dimGrid,threadPerBlock>>>(d_result,d_result1);

	cudaMemcpy(h_result1,d_result1,W*sizeof(float),cudaMemcpyDeviceToHost); 

	int res=0;

 	for(int k=0;k<W; k++) res+= h_result[k];

	

	cout << "Res : " << res << endl;

	cout << "Res2 : " << h_result1[0] << endl;

	

cudaFree(d_result);

cudaFree(d_result1);

	  

  return 0;

}

I’ve chosen a power of two so it should work ! and I get 28 instead of 120. Can somebody explains me which newby mistake I am doing ?

For information :

according to your kernel

unsigned int tid = threadIdx.x;

  unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

I suppose that you want to use 1-D thread block and 1-D grid, then

your execution configuration is wrong.

modify

dim3 threadPerBlock(16,2);

dim3 dimGrid(W/threadPerBlock.x , 1);

as

dim3 threadPerBlock(16,1);

dim3 dimGrid(W/threadPerBlock.x , 1);

second, you use dynamic shared memory in your kernel

extern __shared__ float sdata[];

however you don’t specify how many bytes you need in execution configuration.

modify

reduction<<<dimGrid,threadPerBlock>>>(d_result,d_result1);

as

reduction<<<dimGrid,threadPerBlock, 16*sizeof(float) >>>(d_result,d_result1);

Ok ! Thank you very much !
I was using a ThreadPerBlock = (16,1) I just made a mistake when I recopied the line !
But obviously you’re right my problem was that I did’nt specify how bytes of shared memory I needed !
Thanks

I also need to do reduction, especially, mean of array elements. I checked example of reduction in SDK, but it seems hard to understand. But I also used cascade sum. This code work correctly in emulation mode, but on GPU it doesn’t work. It has errors in results ( big numbers, NANs or else incorrect means) and sometimes lead to artefacts on my monitor (moving snowflakes) like I broke some GPU memory while running the code. Who can tell me, where is the problem.

__global__ void __meanStage1(float *A, float *Sum, long N, long M)

{

	long iS = blockIdx.y * gridDim.x + blockIdx.x;

	for (long j= blockIdx.y * BLOCK_SIZE; j< (blockIdx.y+1) * BLOCK_SIZE; j++)

	   		for (long i= blockIdx.x * BLOCK_SIZE; i< (blockIdx.x+1) * BLOCK_SIZE; i++)

			if( i < N && j < M)

			{

			Sum[iS] = Sum[iS] + A[j * M + i];

			}

	__syncthreads();

}

__global__ void __meanStage2(float *a,float *mean, float *sum, long dX, long dY, long N, long M)

{

	*mean = 0;

	for (long i=0;i<dX * dY;i++)

		*mean = *mean + sum[i];

	if (M*N!=0)

	*mean = (*mean)/(N*M);

	__syncthreads();

}

void means(float *a,float &Means, long N, long M)

{

	float *sum,*mean;

	long iS,jS;

	iS = N / BLOCK_SIZE;

	jS = M / BLOCK_SIZE;

	cudaMalloc((void**) &mean , sizeof(float));

	cudaMalloc((void**) &sum , iS * jS * sizeof(float));

	cudaMemset(sum,0, iS * jS * sizeof(float));

	dim3 dimBlock1(1, 1);

	dim3 dimGrid1(N/BLOCK_SIZE, M/BLOCK_SIZE);

	__meanStage1<<<dimGrid1, dimBlock1>>>(a, sum, N, M);

	dim3 dimBlock2(1, 1);

	dim3 dimGrid2(1, 1);

	__meanStage2<<<dimGrid2, dimBlock2>>>(a, mean, sum, iS, jS, N, M);

	cudaFree(sum);

	cudaMemcpy(&Means, mean, sizeof(float),cudaMemcpyDeviceToHost);

	cudaFree(mean);

}

The indexing scheme in __meanStage1() looks completely wrong. Every thread in any block will have the same iS value, which will cause a memory race - every thread is simultaneously trying to write to the same entry in Sum. I am not surprised the results are rubbish.

yes, but I make only one thread in each block, and N/16 X M/16 blocks in grid.

dim3 dimBlock1(1, 1);

dim3 dimGrid1(N/BLOCK_SIZE, M/BLOCK_SIZE);

__meanStage1<<<dimGrid1, dimBlock1>>>(a, sum, N, M);

and thought that this

long iS = blockIdx.y * gridDim.x + blockIdx.x;

formula will set different iS for each block (absolute address of block in grid) where I have only one thread.

now I found presentation of fast reduce by Mark Harris and got 4-th method from there, but in gpu mode it does segmentation fault.

__global__ void reduce(float *g_idata,float *g_odata)

{

	extern __shared__ int sdata[];

	unsigned int tid = threadIdx.x;

	unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;

	sdata[tid] = g_idata[i] + g_idata[i+blockDim.x];

	__syncthreads();

	for (unsigned int s=blockDim.x/2; s>0; s>>=1) {

	if (tid < s) {

	sdata[tid] += sdata[tid + s];

	}

	__syncthreads();

	}

	if (tid == 0) g_odata[blockIdx.x] = sdata[0];

}

//float *a - input matrix in GPU memory, 

//float Means - mean from host memory

void meanh(float *a,float &Means, long N, long M)

{

	float *sum;

	cudaMalloc((void**) &sum , N * M * sizeof(float));

	cudaMemcpy(sum, a, N*M*sizeof(float),cudaMemcpyDeviceToDevice);

	dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);

	dim3 dimGrid(N/BLOCK_SIZE, M/BLOCK_SIZE);

	reduce<<<dimGrid, dimBlock>>>(sum,sum);

	cudaMemcpy(&Means, sum, sizeof(float),cudaMemcpyDeviceToHost);

	cudaFree(sum);

	Means = Means / (N*M);

}