# Cuda Speedup

Hi.

I have a kernel which computes the sum of the elements of each column of a matrix. The kernel is as follows:

#define BLOCKSIZE_X 64
#define BLOCKSIZE_Y 4

global void Sum(const float *X, float *Y, const int N, const int K)
{
shared float x0[BLOCKSIZE_Y][BLOCKSIZE_X];

int colId = blockIdx.x*BLOCKSIZE_Y + idy;

if(colId < K){
// compute start of X segment
int iter = colId*N;

x0[idy][idx] = 0.0;
for (int n = idx; n < N; n += BLOCKSIZE_X){
int j = iter + n;
x0[idy][idx] += X[j];
}

if (BLOCKSIZE_X >= 64)
x0[idy][idx] += x0[idy][idx + 32];

if (BLOCKSIZE_X >= 32)
x0[idy][idx] += x0[idy][idx + 16];

if (BLOCKSIZE_X >= 16)
x0[idy][idx] += x0[idy][idx + 8];

if (BLOCKSIZE_X >= 8 )
x0[idy][idx] += x0[idy][idx + 4];

if (BLOCKSIZE_X >= 4 )
x0[idy][idx] += x0[idy][idx + 2];

if (BLOCKSIZE_X >= 2 )
x0[idy][idx] += x0[idy][idx + 1];

// store result to global variable
if (idx == 0){
Y[colId] = x0[idy][0];
}
}

K = 60000;
N = 1000;
dim3 grid((K+BLOCKSIZE_Y-1)/BLOCKSIZE_Y, 1, 1);
Sum<<<grid, threads>>> (X, Y, N, K);

Any suggestions on how to speedup the above kernel would be greatly helpful! Thanks.

Hi gpu_user,

I am trying to do exactly the same operation, and I am responding both as a “bump” to your post so that others might respond. And if all else fails, perhaps we can solve the problem together.

Can you tell me what kind of performance you are getting?

pnk

Maybe try N which is a multiple of 16? I guess the problem is that reads from the global memory are not coalesced.
On cuda devices with compute capabilities <= 1.1, reading floats from global memory is fast only if the first thread of a half-warp is reading from an address which is a multiple of 64…

I’m not sure that my idea helps you. But I guess you can do it better by split your addition into a couple of parts. You need for the length of n log2(n) steps to add all elements in case of using parallel computing. I suggest have a lock to some papers with the topic of parallel addition of numbers.