Problem in operations on Matrix with size greater than 31 X 31

Hello everyone…!!
I am trying to implement a iterative linear solver named “Conjugate Gradient Solver” in CUDA which solves equation of form,
where A is sparse symmetric positive definite matrix,
x is unknown vector with initial guess as 0 and
b is a vector on right hans side of the equation.

There are many operations included in my code like Sparse Matrix-vector multiplication,vector-vector operations.

           My code works fine with matrix size upto 31 X 31,but not more than 31 X 31. It may be because of the number of threads allocated to a kernel function. I am allocating threads as                  

Here mul is a function used to perform Sparse matrix-vector multiplication and nrows is the number of rows in a sparse matrix,A.

Is this problem related to 1 wrap size=32 threads ?

If anyone knows,please suggest me.

Thank you…!!

If your problem was caused by warp size alone, then it should begin to fail at 33x33 matrices, but not at 32x32 yet.