Use of CUSPARSE for AX=B

I have a inverse multiplication solver from Matlab that takes around 6ms for solving the system of linear equations Ax=B, where A is 780X780. I have implemented a cublas based solution and it takes around 300ms. Is there any way by using CUBLAS/CUSPARSE, I can get less than the CPU function.

I read a lot of papers but the performance comparison for Ax=b on GPUs is dis-appointing. Does somebody have an idea what is the best way of solving the equation on GPUs to get the maximum performance. Or, is there no way I can beat the optimized Matlab solution on CPU.

Thanks in advance for the replies.

I do not know how you are implementing this, but for a 784x784 lower triangular matrix I get less then 1 ms(it says 0 ms) to solve for AX=B using cublas Strsm(), which is faster than MATLAB.
It returns the correct answer.

Keep in mind that MATLAB is 10-100 times faster than a naive algorithm written in common programming languages. MATLAB is highly optimized and has had teams of Phds working on making it as fast a possible for over 13 years. GPU computing has not even been around half that time, but still whoops MATLAB for larger data sets.

Also are the matrices dense or sparse? GPUs will generally perform better on large dense matrices.

MATLAB also may look for sparsity patterns in dense matrices so it can use the faster sparse solvers. You need to test on full dense matrices using cuBLAS and MATLAB correctly. Some characteristics of the Matrix used in MATLAB may be cached if you run more than once, and that may give the MATLAB routine an advantage.

You are comparing apples to oranges and not correctly setting up the test.
GPUs scale very well, but for such a small matrix the difference will be less significant than a large (n>5000) matrix.

What GPU is being used and what CPU?
Is MATLAB taking advantage of the multithreaded capabilities of the CPU?
What does the command line for nvcc look like?
Are you running in debug mode with -G or -g flags?
Are you using the correct compute capability for your GPU? (1.0, 2.0, 3.0, 3.5 etc)
Are you counting memory copy time?
Are you awaking the GPU from idle state before the test or are you including this time in the test? (Newer GPUs can idle to save power and need a small amount of time to get going)

Even for a dense 4096x4096 matrix , cublas Strsm() takes less than 1 ms(0 ms) on my machine, and I know MATLAB will not be even close to that time for a 4096x4096 dense matrix.

This also returned the correct answer.

So If cuBLAS can solve on Kepler AX=B in less than 1 ms, you really cannot get much better than that.

The Matrix is a banded sparse matrix with a bandwidth of 3.
Dimensions are 780X780.

The graphics card is GTX680 with compute capability 3.0.
There are no -g flags.
Yes, I am taking the memory copy time as well.
The gpu is awake before the cublas call.

When you say cublas strsm take less than 1ms for a dense 4096X4096 matrix, will it be simailar for a 780X780 sparse matrix?

@CudaudaC: I think you misunderstood my problem.
The matrix is a square matrix of 780X780 and I have to solve Ax=B.
So, I want to break A into LU and then do the forward and backward substitution.

You said you wanted to solve for AX=B, and were complaining specifically about cublas

cuBLAS has the ability for DENSE matrices to solve such linear systems, and it is very fast;

cuSPARSE is for sparse matrices, and also has a similar solver;

Also cuSPARSE has an incomplete LU which may be of help;

cuSPARSE also has a bunch of really fast format conversion functions which can convert most of the sparse formats into dense and back in a few ms.

Also in the CUDA SDK samples there is a dense LU decomp sample;

So that gives you all you need to solve your problem. If implemented correctly you will get faster speeds than MATLAB on a GTX 680.

Read over the documentation for cuBLAS and cuSPARSE, go through the CUDA SDK samples in the ‘Advanced’ folder, and make sure you specify the correct project properties.

CUBLAS does not provide any factorization routines.

Yep, but there is that LU decomp sample in the SDK which works.

I use my own Cholesky factor kernel which is faster, but one can use cuBLAS in the process of factoring.

@CudacduC: Are you referring to the LU decomposition that works only on devices of compute capability 3.5

Banded matrix you say? Which diagonals are nonzero? If theyre the -1,0 and 1 diagonals, then CUSPARSE has a gtsv (tridiagonal solve) function will be supremely fast (0.1ms at most)

Yes…all the diagonals are non-zero.
gtsvbatched:- solves my purpose.
Thanks a lot!!

Just curious, what’s the speedup?

For batch size of 1440, the speed-up is 10X on GTX680.
Note that the speed-up is compared to the Matlab based LU decomposition.