fast vector multiply add

Hi Guys,

I’m new to CUDA programming and this is my first post here.

So I hope I get everything right and don’t violate the rules.

I searched the forum on that topic, but got no hits at all

(in addition, the first search always fails with an error and then I have to

wait 20secs to submit it again).

It should be a very common problem, though. So I’m wondering why nobody

has asked this:

like many of us (guess we should combine our achievements instead of

developing all this stuff over and over again) I’m trying – sort of a first

practice in CUDA – to implement a CG solver for a sparse banded matrix.

One of the building blocks is a multiply-add method for large vectors,

i.e. for a scalar alpha and two vectors a_i, b_i calculate

for each i

a_i += alpha*b_i

There is a madd function in CUBLAS, I know. But

today was my first try on CUBLAS and I’m not sure, how to combine this

with ordinary CUDA programming.

The only thing I figured out was that you cannot use the NVCC

with the CUBLAS library. Besides that, I want to learn how to get fast and working

cuda code…

My first try looks like this:

__global__ void

madd(

   /* args */

    int elements_per_block,

   int elements_per_thread,

   float alpha,

   float *b_vector,

   

   /* input/output */

   float *a_vector)

{

  const int B = blockIdx.x;

  const int I = threadIdx.x;

  const int num_threads = blockDim.x;

 int base = B*elements_per_block + I;

  for (int slice = 0; slice < elements_per_thread; ++slice)

    {

      a_vector[base] += alpha*b_vector[base];

      base += num_threads;

    }

}

I tested this and compared to a cpu implementation,

the speed up was about 20, for 30 iterations with two (2 to the power of 24)

vectors and #blocks = #threads =512.

So my first (stupid) question is if this is a reasonable speed up or if my

implementation is bogus in the first place?

Since I use solely global memory (can’t figure out how to make use of

shared mem here), I assume that I’m already memory bound.

Read the CUDA programming guide, and there’s a section where they say

that 16 bytes can be read at once, so my idea for a speed-up looks like this:

__global__ void

madd2(

   /* args */

    int elements_per_block,

   int elements_per_thread,

   float alpha,

   float4 *a_vector,

   

   /* input/output */

   float4 *b_vector)

{

  const int B = blockIdx.x;

  const int I = threadIdx.x;

  const int num_threads = blockDim.x;

 int base = B*elements_per_block + I;

  for (int slice = 0; slice < elements_per_thread; ++slice)

    {

        

      float4 source = b_vector[base];

      float4 dest   = a_vector[base];

      dest.x += alpha*source.x;

      dest.y += alpha*source.y;

      dest.z += alpha*source.z;

      dest.w += alpha*source.w;

      b_vector[base] = dest;

      base += num_threads;

    }

}

So this should reduce the memory accesses.

Of course, while for the first version I would call

madd<<<num_blocks, num_threads>>>(values_per_block, values_per_thread,h_alpha, d_vec0, d_vec1);

the second call goes like this:

madd2<<<num_blocks, num_threads>>>(values_per_block/4, values_per_thread/4 ,h_alpha, (float4*)d_vec0, (float4*)d_vec1);

My tests now show me that this second kernel is about 33% slower than the first

one. Can anybody give me a hint what’s wrong here?

Thanks alot!

ok, I might have found the reason for the 2nd kernel being slower than

the first one. On p. 51 in the programming guide, I found

well…

but still would like to hear about your experiences with multiply add,

how big the speedup is and how you achieved it etc.