Hello, everyone.

I am solving a large linear equations with the conjugate-gradient method. With the algorithm, I want to update a vector x[n] by another vector p[n] with the scaling factor ‘alpha’, like

x[i] = x[i] + alpha*p[i], i=1, n;

as a beginner, the following is my C-code working on CPU,

for( i=0 ; i<n; ++i)

x[i] += alpha*p[i] ;

and here is my cuda code:

**global** void Update_vector(float* x, float* p, float alpha, int n)

{

/* X(k+1) = X(k) + alpha*P(K) */
const int numThreads = blockDim.x * gridDim.x;
const int threadID = blockIdx.x * blockDim.x + threadIdx.x;
for (int i = threadID; i < n; i += numThreads)
{
x[i] += alpha*p[i] ;

}

}

But I know that update a vector has a low arithmetic intensity, and access the global memory has a long latency about 500 cycles. So this code has a poor performance. But how to speed up my cuda program by using the shared memory? :unsure:

I really hope my cuda program has a huge speed up, like the demo “reduction” in the NVIDIA_CUDA_SDK projects.

Thanks.