Hi,

I’m trying to add the rows of a 4800x9600 matrix together, resulting in a matrix 1x9600.

What I’ve done is split the 4800x9600 into 9,600 matrices of length 4800 each. I then perform a reduction on the 4800 elements.

The trouble is, this is really slow…

Anyone got any suggestions?

Basically, I’m trying to implement MATLAB’s sum(…) function.

Here is the code which I’ve verified works fine, it’s just it’s really slow:

void reduceRows(Matrix Dresult,Matrix DA)

{

//split DA into chunks

Matrix Dchunk;

Dchunk.h=1;Dchunk.w=DA.h;

cudaMalloc((void**)&Dchunk.data,Dchunk.h*Dchunk.w*sizeof(float));

Matrix DcolSum;

DcolSum.h=1;DcolSum.w=1;

//cudaMalloc((void**)&DcolSum.data,DcolSum.h*DcolSum.w*sizeof(float));

int i;

for(i=0;i<DA.w;i++)   //loop over each column

{

//printf("%d ",i);

cudaMemcpy(Dchunk.data,&DA.data[i*DA.h],DA.h*sizeof(float),cudaMemcpyDeviceToDevice);

DcolSum.data=&Dresult.data[i];

reduceTotal(DcolSum,Dchunk);

}

cudaFree(Dchunk.data);

}

int main()

{

Matrix data;

data.h=4800;

data.w=9600;

data.data=(float*)calloc(data.h*data.w,sizeof(float));

Matrix Data;

Ddata.h=data.h;

Ddata.w=data.w;

cudaMalloc((void**)&Ddata.data,Ddata.h*Ddata.w*sizeof(float));

Matrix Dsum_rows;

Dsum.h=1;

Dsum.w=data.w;

cudaMalloc((void**)&Dsum_rows.data,Dsum_rows.h*Dsum_rows.w*sizeof(float));

//perform reduction operation:

reduceRows(Dsum_rows,Ddata);

//print out Dsum_rows to verify result is correct, etc.

}

Matrix is defined as:

typedef struct{

long w;

long h;

float* data;

}Matrix;

ReduceTotal() just calls the standard NVIDIA reduction, sums all the elements in Dchunk and puts the answer in DcolSum.

I’m about to do all this on the CPU if I can’t find an answer… ;(

You might find this thread of interest. It covers column summation of a column-major ordered array using a parallel reduction, but row summation is only trivially different.

Hey, thanks so much for that. Very useful. I tried the kernel posted there, but it didn’t work, so I rolled my own…

I’ve implemented the following kernel which gives the correct answer and is going at satisfactory speed:

__global__ void sumCols_kernel(float* result,float* A,int Ah,int Aw)

{

float sum=0.0;

int i;

for(i=0;i<Ah;i++)

{

sum+=A[(id*Ah)+i];

}

result[id]=sum;

}

Called using:

void sumCols(Matrix C,Matrix A)

{

int numBlocks=A.w;

printf("numBlocks:%d\n",numBlocks);

cutilCheckMsg("kernel launch failure");

}

It’s still rather naive though… Just one thread per block…

Any suggestions?

Many thanks,

My suggestion:

Make a 9600 long vector V

Initialize it to zero.

if (tid < 9600)

for (row = 0; row < 4800, row++)

V[tid] = V[tid] + Matrix[row,tid]