I am trying to implement a iterative linear solver named “Conjugate Gradient Solver” in CUDA which solves equation of form,
where A is sparse symmetric positive definite matrix,
x is unknown vector with initial guess as 0 and
b is a vector on right hans side of the equation.
There are many operations included in my code like Sparse Matrix-vector multiplication,vector-vector operations.
My code works fine with matrix size upto 31 X 31,but not more than 31 X 31. It may be because of the number of threads allocated to a kernel function. I am allocating threads as mul<<<1,nrows>>>()
Here mul is a function used to perform Sparse matrix-vector multiplication and nrows is the number of rows in a sparse matrix,A.
Is this problem related to 1 wrap size=32 threads ?
If anyone knows,please suggest me.