How to make use of the shared memory when I perform a point-to-point addition?

Hello, everyone.
I am solving a large linear equations with the conjugate-gradient method. With the algorithm, I want to update a vector x[n] by another vector p[n] with the scaling factor ‘alpha’, like

x[i] = x[i] + alpha*p[i], i=1, n;

as a beginner, the following is my C-code working on CPU,

for( i=0 ; i<n; ++i)
x[i] += alpha*p[i] ;

and here is my cuda code:
global void Update_vector(float* x, float* p, float alpha, int n)
{
/* X(k+1) = X(k) + alpha*P(K) /
const int numThreads = blockDim.x * gridDim.x;
const int threadID = blockIdx.x * blockDim.x + threadIdx.x;
for (int i = threadID; i < n; i += numThreads)
{
x[i] += alpha
p[i] ;
}
}

But I know that update a vector has a low arithmetic intensity, and access the global memory has a long latency about 500 cycles. So this code has a poor performance. But how to speed up my cuda program by using the shared memory? :unsure:
I really hope my cuda program has a huge speed up, like the demo “reduction” in the NVIDIA_CUDA_SDK projects.
Thanks.

Shared memory isn’t going to help, since there’s no element reuse.

You could apply your scale together with some other calculations. There is no need to apply scaling as a separate step via a separate kernel execution.

Thank you, sergeyn. I think I have to do that.

actually i do not get why you use “for” cycle. is it because you want one thread to calculate several elements?

as it seems to me, the usual way to code such things is:

[codebox]

global void Update_vector(float* x, float* p, float alpha, int n)

{

/* X(k+1) = X(k) + alpha*P(K) */

const int threadID = blockIdx.x * blockDim.x + threadIdx.x;

if (threadID < n)

 x[threadID] += alpha*p[threadID] ;

}

[/codebox]