Hi Guys,
I’m new to CUDA programming and this is my first post here.
So I hope I get everything right and don’t violate the rules.
I searched the forum on that topic, but got no hits at all
(in addition, the first search always fails with an error and then I have to
wait 20secs to submit it again).
It should be a very common problem, though. So I’m wondering why nobody
has asked this:
like many of us (guess we should combine our achievements instead of
developing all this stuff over and over again) I’m trying – sort of a first
practice in CUDA – to implement a CG solver for a sparse banded matrix.
One of the building blocks is a multiply-add method for large vectors,
i.e. for a scalar alpha and two vectors a_i, b_i calculate
for each i
a_i += alpha*b_i
There is a madd function in CUBLAS, I know. But
today was my first try on CUBLAS and I’m not sure, how to combine this
with ordinary CUDA programming.
The only thing I figured out was that you cannot use the NVCC
with the CUBLAS library. Besides that, I want to learn how to get fast and working
cuda code…
My first try looks like this:
__global__ void
madd(
/* args */
int elements_per_block,
int elements_per_thread,
float alpha,
float *b_vector,
/* input/output */
float *a_vector)
{
const int B = blockIdx.x;
const int I = threadIdx.x;
const int num_threads = blockDim.x;
int base = B*elements_per_block + I;
for (int slice = 0; slice < elements_per_thread; ++slice)
{
a_vector[base] += alpha*b_vector[base];
base += num_threads;
}
}
I tested this and compared to a cpu implementation,
the speed up was about 20, for 30 iterations with two (2 to the power of 24)
vectors and #blocks = #threads =512.
So my first (stupid) question is if this is a reasonable speed up or if my
implementation is bogus in the first place?
Since I use solely global memory (can’t figure out how to make use of
shared mem here), I assume that I’m already memory bound.
Read the CUDA programming guide, and there’s a section where they say
that 16 bytes can be read at once, so my idea for a speed-up looks like this:
__global__ void
madd2(
/* args */
int elements_per_block,
int elements_per_thread,
float alpha,
float4 *a_vector,
/* input/output */
float4 *b_vector)
{
const int B = blockIdx.x;
const int I = threadIdx.x;
const int num_threads = blockDim.x;
int base = B*elements_per_block + I;
for (int slice = 0; slice < elements_per_thread; ++slice)
{
float4 source = b_vector[base];
float4 dest = a_vector[base];
dest.x += alpha*source.x;
dest.y += alpha*source.y;
dest.z += alpha*source.z;
dest.w += alpha*source.w;
b_vector[base] = dest;
base += num_threads;
}
}
So this should reduce the memory accesses.
Of course, while for the first version I would call
madd<<<num_blocks, num_threads>>>(values_per_block, values_per_thread,h_alpha, d_vec0, d_vec1);
the second call goes like this:
madd2<<<num_blocks, num_threads>>>(values_per_block/4, values_per_thread/4 ,h_alpha, (float4*)d_vec0, (float4*)d_vec1);
My tests now show me that this second kernel is about 33% slower than the first
one. Can anybody give me a hint what’s wrong here?
Thanks alot!