Hi Guys,

I’m new to CUDA programming and this is my first post here.

So I hope I get everything right and don’t violate the rules.

I searched the forum on that topic, but got no hits at all

(in addition, the first search always fails with an error and then I have to

wait 20secs to submit it again).

It should be a very common problem, though. So I’m wondering why nobody

has asked this:

like many of us (guess we should combine our achievements instead of

developing all this stuff over and over again) I’m trying – sort of a first

practice in CUDA – to implement a CG solver for a sparse banded matrix.

One of the building blocks is a multiply-add method for large vectors,

i.e. for a scalar alpha and two vectors a_i, b_i calculate

for each i

a_i += alpha*b_i

There is a madd function in CUBLAS, I know. But

today was my first try on CUBLAS and I’m not sure, how to combine this

with ordinary CUDA programming.

The only thing I figured out was that you cannot use the NVCC

with the CUBLAS library. Besides that, I want to learn how to get fast and working

cuda code…

My first try looks like this:

```
__global__ void
madd(
/* args */
int elements_per_block,
int elements_per_thread,
float alpha,
float *b_vector,
/* input/output */
float *a_vector)
{
const int B = blockIdx.x;
const int I = threadIdx.x;
const int num_threads = blockDim.x;
int base = B*elements_per_block + I;
for (int slice = 0; slice < elements_per_thread; ++slice)
{
a_vector[base] += alpha*b_vector[base];
base += num_threads;
}
}
```

I tested this and compared to a cpu implementation,

the speed up was about 20, for 30 iterations with two (2 to the power of 24)

vectors and #blocks = #threads =512.

So my first (stupid) question is if this is a reasonable speed up or if my

implementation is bogus in the first place?

Since I use solely global memory (can’t figure out how to make use of

shared mem here), I assume that I’m already memory bound.

Read the CUDA programming guide, and there’s a section where they say

that 16 bytes can be read at once, so my idea for a speed-up looks like this:

```
__global__ void
madd2(
/* args */
int elements_per_block,
int elements_per_thread,
float alpha,
float4 *a_vector,
/* input/output */
float4 *b_vector)
{
const int B = blockIdx.x;
const int I = threadIdx.x;
const int num_threads = blockDim.x;
int base = B*elements_per_block + I;
for (int slice = 0; slice < elements_per_thread; ++slice)
{
float4 source = b_vector[base];
float4 dest = a_vector[base];
dest.x += alpha*source.x;
dest.y += alpha*source.y;
dest.z += alpha*source.z;
dest.w += alpha*source.w;
b_vector[base] = dest;
base += num_threads;
}
}
```

So this should reduce the memory accesses.

Of course, while for the first version I would call

```
madd<<<num_blocks, num_threads>>>(values_per_block, values_per_thread,h_alpha, d_vec0, d_vec1);
```

the second call goes like this:

```
madd2<<<num_blocks, num_threads>>>(values_per_block/4, values_per_thread/4 ,h_alpha, (float4*)d_vec0, (float4*)d_vec1);
```

My tests now show me that this second kernel is about 33% slower than the first

one. Can anybody give me a hint what’s wrong here?

Thanks alot!