Cuda Speedup

gpu_user · October 3, 2009, 12:12am

Hi.

I have a kernel which computes the sum of the elements of each column of a matrix. The kernel is as follows:

#define BLOCKSIZE_X 64
#define BLOCKSIZE_Y 4

global void Sum(const float *X, float *Y, const int N, const int K)
{
shared float x0[BLOCKSIZE_Y][BLOCKSIZE_X];

int idx = threadIdx.x;
int idy = threadIdx.y;

int colId = blockIdx.x*BLOCKSIZE_Y + idy;

if(colId < K){
// compute start of X segment
int iter = colId*N;

x0[idy][idx] = 0.0;
for (int n = idx; n < N; n += BLOCKSIZE_X){
int j = iter + n;
x0[idy][idx] += X[j];
}

__syncthreads();

// add partial means
if (BLOCKSIZE_X >= 64)
x0[idy][idx] += x0[idy][idx + 32];

if (BLOCKSIZE_X >= 32)
x0[idy][idx] += x0[idy][idx + 16];

if (BLOCKSIZE_X >= 16)
x0[idy][idx] += x0[idy][idx + 8];

if (BLOCKSIZE_X >= 8 )
x0[idy][idx] += x0[idy][idx + 4];

if (BLOCKSIZE_X >= 4 )
x0[idy][idx] += x0[idy][idx + 2];

if (BLOCKSIZE_X >= 2 )
x0[idy][idx] += x0[idy][idx + 1];

// store result to global variable
if (idx == 0){
Y[colId] = x0[idy][0];
}
}

K = 60000;
N = 1000;
dim3 threads(BLOCKSIZE_X, BLOCKSIZE_Y, 1);
dim3 grid((K+BLOCKSIZE_Y-1)/BLOCKSIZE_Y, 1, 1);
Sum<<<grid, threads>>> (X, Y, N, K);

Any suggestions on how to speedup the above kernel would be greatly helpful! Thanks.

pnk · October 18, 2009, 6:37pm

Hi gpu_user,

I am trying to do exactly the same operation, and I am responding both as a “bump” to your post so that others might respond. And if all else fails, perhaps we can solve the problem together.

Can you tell me what kind of performance you are getting?

pnk

Julek · October 19, 2009, 12:55pm

Maybe try N which is a multiple of 16? I guess the problem is that reads from the global memory are not coalesced.
On cuda devices with compute capabilities <= 1.1, reading floats from global memory is fast only if the first thread of a half-warp is reading from an address which is a multiple of 64…

analyzer · October 20, 2009, 6:53am

I’m not sure that my idea helps you. But I guess you can do it better by split your addition into a couple of parts. You need for the length of n log2(n) steps to add all elements in case of using parallel computing. I suggest have a lock to some papers with the topic of parallel addition of numbers.

Topic		Replies	Views
CUDA Speedup CUDA Programming and Performance	0	2044	October 3, 2009
Global Memory Read Throughput CUDA Programming and Performance	0	5957	October 7, 2009
Interpretation of Kernel CUDA Programming and Performance	4	3084	August 11, 2009
CUDA - calculation of a sum CUDA Programming and Performance	7	5530	April 30, 2010
Summing matrix elements CUDA Programming and Performance	3	6935	July 4, 2011
fast vector multiply add CUDA Programming and Performance	1	6602	March 29, 2008
floyd on cuda--why so slow? CUDA Programming and Performance	15	5463	May 2, 2009
CUDA Timing Question CUDA Programming and Performance	7	1996	September 1, 2009
scatter and gather with CUDA? CUDA Programming and Performance	3	10092	March 9, 2009
syncthreads() and += operator... CUDA Programming and Performance	6	6342	December 20, 2009

Cuda Speedup

Related topics