Banded sparse matrix linear eqution solve with CUDA

My problem: I have 5-diagonal single-precision matrix with dimensions about 5000x5000 (so there are ~25000 non-zero elements there) and have to solve linear equation system. MATLAB made it for ~1 ms when using "" operator. I tried realize it with CUDA and encountered following problems.

At first, CUBLAS can’t solve non-triangle linear equations and doesn’t have any factorization functions (except getrfBatched(), which is limited by 32x32 dimensions). OK, I used free version of CULA library and got too slow computation time, because this version doesn’t have banded solver.

At second, I tried to use CUSPARSE library. I tried checking conjugate gradient method. But it’s proved to be slow too, because it has too many (hundreds) iterations (but everyone is very fast, less than 1 ms).

At third, because CUSPARSE library has got incomplete Cholesky and LU factorization functions, I tried check preconditioned conjugate gradient method. But it was too slow too, because cusparseScsrsv_analysis()/cusparseScsrsv_solve() functions need 25 and 8 ms respectively.

Now I don’t know what to do. It seems I need fast banded matrix solve realization, but I can’t find corresponding library. May be I don’t have to use GPU for this problem and there are some CPU-optimized libraries for my problem?