Multiple Reduction in a 2D array Using the easiest reduction example of the SDK

Thibaud · November 12, 2009, 10:09am

Hello everyone,

I continue my discovery about reduction !

Now, I want to do something little special !

I have an array of floats of 4 rows and 8 columns, I want to have , after reduction, an array of 1 row & 8 columns which will contain the sum of every column. For example

0   1	2	3	4	5	6	7 

8   9	10   11   12   13   14   15

16  17   18   19   20   21   22   23

24  25   26   27   28   29   30   31	   

gives

48  52  56   60   64   68   72   76

So I declare 8 blocks that will contain 4 threads each, and I make a normal reduction for every block !

The code I execute on the device is the same than the Reduction #1 example :

// Kernel that executes on the CUDA device

__global__ void reduction(float *g_idata, float *g_odata) {

	extern __shared__ int sdata[];

	

	// each thread loads one element from global to shared mem

	unsigned int tid = threadIdx.x;

	unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

	sdata[tid] = g_idata[i];

	__syncthreads();

	

	// do reduction in shared mem

	for (unsigned int s=1; s < blockDim.x; s *= 2)  {

		if ( tid % (2*s)==0) 

	{

			sdata[tid] += sdata[tid + s];

		}

		__syncthreads();

	}

	// write result for this block to global mem

	if (tid == 0) g_odata[blockIdx.x] = sdata[0];

}

Doing so the block 0 which will compute the 1st column will write its result in g_odata[0], the block 1 will compute the 2nd column and write in g_odata[1] and so on and so on !

Unfortunately it does not work and after lots of memory schematics I’ve made, I think the problem is not on the device code ! (Please correct me if I’m wrong)

So my main code is :

float * d_result;

float * d_result1;

unsigned int W = 8;

unsigned int H = 4;

float h_result[W][H];

float h_result1[W][H];

size_t pitch,pitch1;

int temp=0;

int i=0,j=0;

// Matrix filling

  for (i=0;i<W;i++)

	  {

	  for(j=0;j<H;j++)

	  {

	  h_result[i][j]=temp;

	  temp++;

	  }

	  }

// Memory allocation

   cudaMallocPitch((void**) &d_result, &pitch, W*sizeof(float), H);

   cudaMallocPitch((void**) &d_result1, &pitch1, W*sizeof(float), H);

// Memory copying

   cudaMemcpy2D(d_result,pitch,h_result,W*sizeof(float),W*sizeof(float),H,cudaMemcpyHostToDevice);

	  

	dim3 threadPerBlock(H,1);

	dim3 dimGrid(W,1);

	reduction<<<dimGrid,threadPerBlock, H*sizeof(float)>>>(d_result,d_result1);

	cudaMemcpy2D(h_result1,W*sizeof(float),d_result1,pitch1,W*sizeof(float),H,cudaMemcpyDeviceToHost);

	for(j=0;j<H;j++) for(i=1;i<W;i++) h_result[0][j]+=h_result[i][j];

	cout << "Expected : " << endl;

	for(j=0;j<H;j++) cout << h_result[0][j] << " ";

	

	cout << endl << "Obtained : " << endl;

	for(i=0;i<H;i++) cout << h_result1[0][i] << " ";

	cout << endl;

	

cudaFree(d_result);

cudaFree(d_result);

	  

  return 0;

}

Could you see any problem in my configuration or in my way of dealing about memory copying or allocation ?!

Thibaud · November 13, 2009, 8:35am

No idea ?

David_Lisin · November 13, 2009, 11:41am

for (unsigned int s=1; s < blockDim.x; s *= 2) {

    if ( tid % (2*s)==0)

{

        sdata[tid] += sdata[tid + s];

    }

    __syncthreads();

}

I have had problems in the past when performing the if statement that you have, due to problems with performing mod. Please try one of the other methods sugested by nVidia to avoid using mod.

If you have any success, please write back! David

Thibaud · November 17, 2009, 1:37pm

I have tried your suggestion but nothing’s better !
Do we agree that the size of the shared memory I’m sending to the kernell is the size of the shared memory PER block ?
I will try to understand what is going on with this code and every suggestion is welcomed !

Thibaud · November 18, 2009, 11:52am

Hello again !

The source of my problem was that my CudaMallocPitch did not give me a continuous space memory (I don’t understand why) so I decided to linearize my h_Result[W][H] array to a h_in[WH] array in order to use CudaMalloc & CudaMemCpy instead of CudaMallocPitch & CudaMemCpy2D !

That’s working ! So, for those who are interested, is here a multiple reduction that works :

int main()

{

float * d_out;

float * d_in;

unsigned int W = 256;

unsigned int H = 24;

float h_in[W*H];

float h_out[W];

int index=0;

int i=0,j=0,k=0;

for (i=0;i<W;i++)

	{

	  for(j=0;j<H;j++)

		  {

		  h_in[index]=j+1;

		  index++;

		  }

	}

// Memory allocation

	cudaMalloc((void**)&d_in,W*H*sizeof(float));

	cudaMalloc((void**)&d_out,W*sizeof(float));

	// Memory copying 

	cudaMemcpy(d_in,h_in,W*H*sizeof(float),cudaMemcpyHostToDevice); 

	dim3 threadPerBlock(H,1);

	dim3 dimGrid(W,1);

	reduction<<<dimGrid,threadPerBlock, H*sizeof(float)>>>(d_in,d_out,k);

	cudaMemcpy(h_out,d_out,W*sizeof(float),cudaMemcpyDeviceToHost); 

	cout << endl << "Obtained : " << endl;

	for (i=0;i<W;i++) cout << h_out[i] << " ";

	cout << endl;

cudaFree(d_in);

cudaFree(d_out);

	  

  return 0;

}

Maybe some of you can understand why my 2D progam was not working, even if it will not be useful for my program, I would be really interested to understand what was wrong there !

avidday · November 18, 2009, 12:01pm

I suspect that the cudaMallocPitch documentation constains the answer. That function can and will pad allocated storage for improved performance.

LSChien · November 18, 2009, 2:15pm

I think that you need to use column-major for h_result.

I modify your code in the first post,
(1) use column-major for h_result
(2) use cudaMalloc and cudaMemcpy

it works. However I am confused at this post, do you use column-major or row-major

Topic		Replies	Views
Reduction & block dimension Using the easiest reduction example of the SDK CUDA Programming and Performance	6	2186	November 23, 2009
Reduction questions(newbie-ish) CUDA Programming and Performance	7	1790	January 14, 2009
Parallel reduction problem CUDA Programming and Performance	1	5078	November 29, 2010
Understanding and adjusting Mark Harris's array reduction CUDA Programming and Performance	11	4036	August 26, 2018
float reduction, cpu and cuda answers differ CUDA Programming and Performance	4	3309	April 1, 2008
reduction.cpp - block execution order CUDA Programming and Performance	5	5384	December 31, 2011
Reduction from SDK CUDA Programming and Performance	2	11601	March 12, 2009
CUDA 2D Array Problem Need help to manipulate 2D arrays in CUDA CUDA Programming and Performance	4	26431	March 17, 2011
How to do Reduction in column for a matrix CUDA Programming and Performance	2	1482	May 3, 2019
CUDA reduction CUDA Programming and Performance	10	51379	June 7, 2009

Multiple Reduction in a 2D array Using the easiest reduction example of the SDK

Related topics