Hi Guys,

I’m new to CUDA programming and this is my first post here.

So I hope I get everything right and don’t violate the rules.

I searched the forum on that topic, but got no hits at all

(in addition, the first search always fails with an error and then I have to

wait 20secs to submit it again).

It should be a very common problem, though. So I’m wondering why nobody

like many of us (guess we should combine our achievements instead of

developing all this stuff over and over again) I’m trying – sort of a first

practice in CUDA – to implement a CG solver for a sparse banded matrix.

One of the building blocks is a multiply-add method for large vectors,

i.e. for a scalar alpha and two vectors a_i, b_i calculate

for each i

a_i += alpha*b_i

There is a madd function in CUBLAS, I know. But

today was my first try on CUBLAS and I’m not sure, how to combine this

with ordinary CUDA programming.

The only thing I figured out was that you cannot use the NVCC

with the CUBLAS library. Besides that, I want to learn how to get fast and working

cuda code…

My first try looks like this:

__global__ void

/* args */

int elements_per_block,

float alpha,

float *b_vector,

/* input/output */

float *a_vector)

{

const int B = blockIdx.x;

int base = B*elements_per_block + I;

for (int slice = 0; slice < elements_per_thread; ++slice)

{

a_vector[base] += alpha*b_vector[base];

}

}

I tested this and compared to a cpu implementation,

the speed up was about 20, for 30 iterations with two (2 to the power of 24)

vectors and #blocks = #threads =512.

So my first (stupid) question is if this is a reasonable speed up or if my

implementation is bogus in the first place?

Since I use solely global memory (can’t figure out how to make use of

shared mem here), I assume that I’m already memory bound.

Read the CUDA programming guide, and there’s a section where they say

that 16 bytes can be read at once, so my idea for a speed-up looks like this:

__global__ void

/* args */

int elements_per_block,

float alpha,

float4 *a_vector,

/* input/output */

float4 *b_vector)

{

const int B = blockIdx.x;

int base = B*elements_per_block + I;

for (int slice = 0; slice < elements_per_thread; ++slice)

{

float4 source = b_vector[base];

float4 dest   = a_vector[base];

dest.x += alpha*source.x;

dest.y += alpha*source.y;

dest.z += alpha*source.z;

dest.w += alpha*source.w;

b_vector[base] = dest;

}

}

So this should reduce the memory accesses.

Of course, while for the first version I would call

the second call goes like this:

My tests now show me that this second kernel is about 33% slower than the first

one. Can anybody give me a hint what’s wrong here?

Thanks alot!

ok, I might have found the reason for the 2nd kernel being slower than

the first one. On p. 51 in the programming guide, I found

well…