Hello, everyone.
I am solving a large linear equations with the conjugate-gradient method. With the algorithm, I want to update a vector x[n] by another vector p[n] with the scaling factor ‘alpha’, like
x[i] = x[i] + alpha*p[i], i=1, n;
as a beginner, the following is my C-code working on CPU,
for( i=0 ; i<n; ++i)
x[i] += alpha*p[i] ;
and here is my cuda code:
global void Update_vector(float* x, float* p, float alpha, int n)
{
/* X(k+1) = X(k) + alpha*P(K) /
const int numThreads = blockDim.x * gridDim.x;
const int threadID = blockIdx.x * blockDim.x + threadIdx.x;
for (int i = threadID; i < n; i += numThreads)
{
x[i] += alphap[i] ;
}
}
But I know that update a vector has a low arithmetic intensity, and access the global memory has a long latency about 500 cycles. So this code has a poor performance. But how to speed up my cuda program by using the shared memory? :unsure:
I really hope my cuda program has a huge speed up, like the demo “reduction” in the NVIDIA_CUDA_SDK projects.
Thanks.