sorry I couldn’t think of a better title, but here is my problem. I’m trying to implement conjugate gradients for solving Ax=b. So I first need to compute
alpha = rho/dot(a,p)
and then r = r - alpha*a
is there anyway i can get alpha into constant memory or something like that. because right now every thread in the saxpy computation is trying to access alpha and hence they get blocked.
i also tried making alpha a 1x1 texture with the hope of speeding things up due to caching, but then every time alpha changes (with each conj grad iteration) i will have to rebind my texture.
please let me know what the best way of doing this would be.
If it’s a scalar, can you just have thread 0 of each block load alpha into shared memory, and then __syncthreads() and then all threads can read it without conflicts (due to shared memory broadcast)?