an innerproduct followed by a saxpy

sorry I couldn’t think of a better title, but here is my problem. I’m trying to implement conjugate gradients for solving Ax=b. So I first need to compute

alpha = rho/dot(a,p)

and then r = r - alpha*a

is there anyway i can get alpha into constant memory or something like that. because right now every thread in the saxpy computation is trying to access alpha and hence they get blocked.

i also tried making alpha a 1x1 texture with the hope of speeding things up due to caching, but then every time alpha changes (with each conj grad iteration) i will have to rebind my texture.

please let me know what the best way of doing this would be.

thanks in advance.

If alpha is just a number, can you pass it into the kernel as a parameter in the function call?

Yes but to do this I think I have to read back into host memory and then pass it. And the readback is going to take some time.

Is alpha an array or a scalar?

If it’s a scalar, can you just have thread 0 of each block load alpha into shared memory, and then __syncthreads() and then all threads can read it without conflicts (due to shared memory broadcast)?



there is a standard function cudaMemset(void* devPtr, int value, size_t count);

it fills the first count bytes of the memory pointed to by devPtr with the constant byte value value.