Use of CUSPARSE for AX=B

Karan_Sharma · July 17, 2013, 10:36am

I have a inverse multiplication solver from Matlab that takes around 6ms for solving the system of linear equations Ax=B, where A is 780X780. I have implemented a cublas based solution and it takes around 300ms. Is there any way by using CUBLAS/CUSPARSE, I can get less than the CPU function.

I read a lot of papers but the performance comparison for Ax=b on GPUs is dis-appointing. Does somebody have an idea what is the best way of solving the equation on GPUs to get the maximum performance. Or, is there no way I can beat the optimized Matlab solution on CPU.

Thanks in advance for the replies.

CudaaduC · July 17, 2013, 7:27pm

I do not know how you are implementing this, but for a 784x784 lower triangular matrix I get less then 1 ms(it says 0 ms) to solve for AX=B using cublas Strsm(), which is faster than MATLAB.
It returns the correct answer.

Keep in mind that MATLAB is 10-100 times faster than a naive algorithm written in common programming languages. MATLAB is highly optimized and has had teams of Phds working on making it as fast a possible for over 13 years. GPU computing has not even been around half that time, but still whoops MATLAB for larger data sets.

Also are the matrices dense or sparse? GPUs will generally perform better on large dense matrices.

MATLAB also may look for sparsity patterns in dense matrices so it can use the faster sparse solvers. You need to test on full dense matrices using cuBLAS and MATLAB correctly. Some characteristics of the Matrix used in MATLAB may be cached if you run more than once, and that may give the MATLAB routine an advantage.

You are comparing apples to oranges and not correctly setting up the test.
GPUs scale very well, but for such a small matrix the difference will be less significant than a large (n>5000) matrix.

What GPU is being used and what CPU?
Is MATLAB taking advantage of the multithreaded capabilities of the CPU?
What does the command line for nvcc look like?
Are you running in debug mode with -G or -g flags?
Are you using the correct compute capability for your GPU? (1.0, 2.0, 3.0, 3.5 etc)
Are you counting memory copy time?
Are you awaking the GPU from idle state before the test or are you including this time in the test? (Newer GPUs can idle to save power and need a small amount of time to get going)

Even for a dense 4096x4096 matrix , cublas Strsm() takes less than 1 ms(0 ms) on my machine, and I know MATLAB will not be even close to that time for a 4096x4096 dense matrix.

This also returned the correct answer.

So If cuBLAS can solve on Kepler AX=B in less than 1 ms, you really cannot get much better than that.

Karan_Sharma · July 18, 2013, 5:43am

@CudaaduC:
The Matrix is a banded sparse matrix with a bandwidth of 3.
Dimensions are 780X780.

The graphics card is GTX680 with compute capability 3.0.
There are no -g flags.
Yes, I am taking the memory copy time as well.
The gpu is awake before the cublas call.

When you say cublas strsm take less than 1ms for a dense 4096X4096 matrix, will it be simailar for a 780X780 sparse matrix?

Karan_Sharma · July 18, 2013, 11:26am

@CudaudaC: I think you misunderstood my problem.
The matrix is a square matrix of 780X780 and I have to solve Ax=B.
So, I want to break A into LU and then do the forward and backward substitution.

CudaaduC · July 18, 2013, 3:28pm

You said you wanted to solve for AX=B, and were complaining specifically about cublas

cuBLAS has the ability for DENSE matrices to solve such linear systems, and it is very fast;

http://docs.nvidia.com/cuda/cublas/index.html#topic_8_7

cuSPARSE is for sparse matrices, and also has a similar solver;

http://docs.nvidia.com/cuda/cusparse/index.html#topic_8_3

Also cuSPARSE has an incomplete LU which may be of help;

http://docs.nvidia.com/cuda/cusparse/index.html#topic_10_2

cuSPARSE also has a bunch of really fast format conversion functions which can convert most of the sparse formats into dense and back in a few ms.

http://docs.nvidia.com/cuda/cusparse/index.html#topic_11_3

Also in the CUDA SDK samples there is a dense LU decomp sample;

http://docs.nvidia.com/cuda/cuda-samples/index.html#advanced

So that gives you all you need to solve your problem. If implemented correctly you will get faster speeds than MATLAB on a GTX 680.

Read over the documentation for cuBLAS and cuSPARSE, go through the CUDA SDK samples in the ‘Advanced’ folder, and make sure you specify the correct project properties.

mfatica · July 18, 2013, 10:50pm

CUBLAS does not provide any factorization routines.

CudaaduC · July 18, 2013, 10:58pm

Yep, but there is that LU decomp sample in the SDK which works.

I use my own Cholesky factor kernel which is faster, but one can use cuBLAS in the process of factoring.

Karan_Sharma · July 19, 2013, 8:14am

@CudacduC: Are you referring to the LU decomposition that works only on devices of compute capability 3.5

sBc-Random · July 20, 2013, 4:04pm

Banded matrix you say? Which diagonals are nonzero? If theyre the -1,0 and 1 diagonals, then CUSPARSE has a gtsv (tridiagonal solve) function will be supremely fast (0.1ms at most)

Karan_Sharma · July 21, 2013, 1:48pm

Yes…all the diagonals are non-zero.
gtsvbatched:- solves my purpose.
Thanks a lot!!

sBc-Random · July 21, 2013, 2:57pm

Just curious, what’s the speedup?

Karan_Sharma · July 22, 2013, 2:25pm

For batch size of 1440, the speed-up is 10X on GTX680.
Note that the speed-up is compared to the Matlab based LU decomposition.

Topic		Replies	Views
cusparse Ax=B? GPU-Accelerated Libraries	4	1313	February 28, 2014
How to solve a system of linear equation(matrix of the order of 10000*100000) using CUBLAS GPU-Accelerated Libraries	3	1305	October 16, 2014
Banded sparse matrix linear eqution solve with CUDA GPU-Accelerated Libraries	0	2277	January 16, 2013
Feasibility of a Matlab project CUDA Setup and Installation	2	1247	August 7, 2014
Linear equations and sparse matrix CUBLAS, CUSPARSE, CULA CUDA Programming and Performance	2	2596	May 20, 2011
Is there library or code to solve Large Sparse Linear system or do Large sparse Matrix Multiplicatio CUDA Programming and Performance	4	3386	April 28, 2009
Solve sparse system of Ax = B with LU GPU-Accelerated Libraries	0	705	May 17, 2015
cuSPARSE for solving Ax=b on matrix ~ 230400x230400 GPU-Accelerated Libraries	3	3827	December 31, 2015
Solutions of linear equations CUDA Programming and Performance	7	5514	July 16, 2010
Solving an equation system with CUBLAS CUDA Programming and Performance	6	16205	February 8, 2009

Use of CUSPARSE for AX=B

Related topics