Add Rows of a Matrix Matrix row addition incredibly slow...

Hi,

I’m trying to add the rows of a 4800x9600 matrix together, resulting in a matrix 1x9600.

What I’ve done is split the 4800x9600 into 9,600 matrices of length 4800 each. I then perform a reduction on the 4800 elements.

The trouble is, this is really slow…

Anyone got any suggestions?

Basically, I’m trying to implement MATLAB’s sum(…) function.

Here is the code which I’ve verified works fine, it’s just it’s really slow:

void reduceRows(Matrix Dresult,Matrix DA)

{

		//split DA into chunks

		Matrix Dchunk;

		Dchunk.h=1;Dchunk.w=DA.h;

		cudaMalloc((void**)&Dchunk.data,Dchunk.h*Dchunk.w*sizeof(float));

		Matrix DcolSum;

		DcolSum.h=1;DcolSum.w=1;

		//cudaMalloc((void**)&DcolSum.data,DcolSum.h*DcolSum.w*sizeof(float));

		int i;

		for(i=0;i<DA.w;i++)   //loop over each column

		{

				//printf("%d ",i);

				cudaMemcpy(Dchunk.data,&DA.data[i*DA.h],DA.h*sizeof(float),cudaMemcpyDeviceToDevice);

				DcolSum.data=&Dresult.data[i];

				reduceTotal(DcolSum,Dchunk);

		}

		cudaFree(Dchunk.data);

}

int main()

{

	Matrix data;

	data.h=4800;

	data.w=9600;

	data.data=(float*)calloc(data.h*data.w,sizeof(float));

	Matrix Data;

	Ddata.h=data.h;

	Ddata.w=data.w;

	cudaMalloc((void**)&Ddata.data,Ddata.h*Ddata.w*sizeof(float));

	Matrix Dsum_rows;

	Dsum.h=1;

	Dsum.w=data.w;

	cudaMalloc((void**)&Dsum_rows.data,Dsum_rows.h*Dsum_rows.w*sizeof(float));

	readCSV("./data.csv",data);   //function to read data from file (function not shown)

	//perform reduction operation:

	reduceRows(Dsum_rows,Ddata);

	//print out Dsum_rows to verify result is correct, etc.

}

Matrix is defined as:

typedef struct{

		long w;

		long h;

		float* data;

}Matrix;

ReduceTotal() just calls the standard NVIDIA reduction, sums all the elements in Dchunk and puts the answer in DcolSum.

I’m about to do all this on the CPU if I can’t find an answer… ;(

Many thanks in advance,

You might find this thread of interest. It covers column summation of a column-major ordered array using a parallel reduction, but row summation is only trivially different.

Hey, thanks so much for that. Very useful. I tried the kernel posted there, but it didn’t work, so I rolled my own…

I’ve implemented the following kernel which gives the correct answer and is going at satisfactory speed:

__global__ void sumCols_kernel(float* result,float* A,int Ah,int Aw)

{

		int id=threadIdx.x+blockDim.x*threadIdx.y+(blockIdx.x*blockDim.x*blockDim.y)+(blockIdx.y*blockDim.x*blockDim.y*gridDim.x);

		float sum=0.0;

		int i;

		for(i=0;i<Ah;i++)

		{	   

				sum+=A[(id*Ah)+i];

		}

		result[id]=sum;

}

Called using:

void sumCols(Matrix C,Matrix A)

{

		int numBlocks=A.w;

		int numThreads=1;

		printf("numBlocks:%d\n",numBlocks);

		sumCols_kernel<<<numBlocks,numThreads>>>(C.data,A.data,A.h,A.w);

		cutilCheckMsg("kernel launch failure");

}

It’s still rather naive though… Just one thread per block…

Any suggestions?

Many thanks,

My suggestion:

Make a 9600 long vector V

Initialize it to zero.

if (tid < 9600)

for (row = 0; row < 4800, row++)

V[tid] = V[tid] + Matrix[row,tid]

When your are done, V contains your sum.

Since your matrix is 9600 columns long, make it 192 threads per block and 50 blocks. The loop is inside one kernel. While you will have just 1 flop for every twelve bytes read/written, this will not be spectacularly fast, but should be ok. Essentially what you are doing is a saxpy with alpha = 1. You might consider take that into consideration and use CUBLAS’ saxpy. Each row is a vector Mr and you are doing V = 1Mr + V (y = ax+y) 4,800 times.