Why is CUDA slower than UMFPACK in solving the linear system Ax=b

Hi,

I have written a code in C++ on a Linux system which solves the linear system Ax=b, where A is a sparse symmetric matrix using the following two approaches:

  1. Using UMFPACK to sequentially factorize and do backward forward substitution.
  2. Using UMFPACK to sequentially factorize followed by a backward forward substitution using the CUDA library cuSPARSE.

The system configuration I have is: CUDA version 5.0, UMFACK version 5.6.2, Linux kernel version Debian 3.2.46-1.

Theoretically, the second approach should perform better than the first approach with minimal or no errors. However, I am observing the following problems:

  1. The backward/forward substitution using UMFPACK function umfpack_di_solve is almost 2x faster than CUDA variant.
  2. For some matrices, the error between the results obtained using UMFPACK and CUDA are quite large with the maximum error being 3.2537, whereas, for other matrices, it is of the order of 1e-16.

Attached is my tar file with the following components:

  • A folder factorize_copy with the main file fc.cu that I am using to solve the linear system. It reads the sparse matrices from the files grid_*_CSC.m which are also present in the same directory. For convenience, the results with the three sparse matrices provided are also given in the text files.
  • A folder with all the dependencies for compiling and running UMFPACK (which we are also using for computations).

The link to the tar file is

If you wish to run the code, I have provided a MAKEFILE as I use in my system in the factorize_copy directory. You may need to recompile the UMFPACK Libraries.

A sample output from our program for a 586x586 Sparse Matrix is also shown below (Note that errors for this case are very high as compared to other sparse matrices that we checked).

***** Reading the Grids

	Reading Grids Successful

***** Solving a sparse matrix of size: 586x586

***** Solving the grid on umfpack

***** Factorizing The Grid

-------------- CPU TIME for umfpack factorization is: 0.00109107

-------------- Wall-Clock TIME for umfpack factorization is: 0

	Factorizing the Grid successful

	Solving the grid on umfpack successful

-------------- CPU TIME for umfpack solve is: 6.281e-05

***** Allocating GPU memory and Copying data

---------------- CPU Time for Allocating GPU memory and Copying Data: 1.6

***** Performing b = P*b to account for the row ordering in A 

	Matrix-Vector (Pb) multiplication successful

***** Solving the system: LUx=b

	Analyzing Ly = b successful

	Solving Ly = b successful

	Analyzing Ux = y successful

	Solving Ux = y successful

***** Performing x = Q*x to account for the column ordering in A

	Matrix-Vector (Qx) multiplication successful

---------- GPU Time for the solve is: 5.68029 ms

##### Maximum error between UMFPACK and CUDA: 3.2537

##### Average error between UMFPACK and CUDA: 0.699926

***** Writing the results to the output files

	Result Written to the file 'vout_586.m' and the file 'vout_umfpack_586.m'

(Operation Successful!)

I would really appreciate if anyone could point out what the possible error could be in this case. If there is a better method of solving sparse linear systems using CUDA that I am missing out, please let me know.

EDIT: I figured out why it was giving error in some cases and in some cases not. I had a mistake in the number of threads per block when calling a kernel function in the code. However, I still have the issue of gaining a speed-up.

A few observations:

(1) It is impossible to put the stated performance numbers into perspective without knowing what kind of hardware the CPU and GPU software were running on and in what kind of configuration. I do not use CUSPARSE, but I could imagine that a sparse system of dimension 586 is too small to achieve really good performance on the GPU.

(2) For most kinds of errors, comparing the error from two pieces of software, both presumably operating at the same precision unfortunately tells us nothing about which software produces more accurate results. A reasonable way to find out would be comparison with results from software that uses higher precision. I frequently use dense solvers using double-double arithmetic as a reference in my own work.

(3) It is not clear what kind of error is being reported here: forward error, backward error, absolute error, relative error, normwise error? Depending on what error is being reported, vastly different amounts of errors between different systems could be a function of the condition number of the matrix, for example.

Hi Njuffa,

Thanks for your observations. The configuration I used is (a) CPU - Intel core i7-4770 Quad core processor, and (b) Graphics card - GeForce GTX Titan. Regarding your observations:

The result reported here is just a sample, the same trend is observed for large sparse matrices as well, for example, for 113304 system, CPU TIME for umfpack backward/forward substitution is 0.0330232 secs, whereas for CUDA it is 0.145 secs. For even bigger grids, CUDA slows down more. By the way, is it reasonable to expect a speedup from CUDA, when the CPU-only version itself is so fast? I had no idea before writing the code that backward/forward solve using UMFPACK is that fast.

Can you name any such ‘dense solvers using double-double arithmetic’ using which we can check the results? That would be quite helpful.

The errors are calculated based on the difference in the final results obtained using both the approaches, i.e. if x_umfpack and x_cuda are the results obtained when Ax = b is solved using approach 1 (UMFPACK) and approach 2 (CUDA), respectively, then

max_error = MAX( abs (v_umfpack - v_cuda) );
avg_error = MEAN( sum (abs (v_umfpack - v_cuda) ) );

I have checked the condition numbers of the matrices, and they seem to be fine.

If I searched the internet correctly your CPU would appear to be a 3.4 GHz quad-core Haswell with 8 MB L3, which means in can hold the entire data for your sample problem.

As I said, I do not use CUSPARSE so I have no idea what performance you should expect here. My comment regarding the size of your system was based on my understanding that typical use cases for CUSPARSE start at around dimension of 2000, and that a suitable strategy for systems that are the size of your system can be to simply use a dense solver on the expanded version of the matrix.

I do not have access to your code, so I cannot tell whether your code calls a single CUSPARSE function or a sequence of CUSPARSE functions. If it is a single CUSPARSE function that does not scale well compared to an equivalent function on the CPU I would suggest filing a feature request for a more optimized version using the bug reporting form linked from the registered developer website. If the code consists of a sequence of CUSPARSE operation you may want to consider whether the chosen approach makes the best possible use of CUSPARSE (maybe one of the other CUSPARSE users among the customers frequenting these forums could look at your code and comment on possible improvements).

In my work, I use manually written reference solvers using double-double arithmetic, nothing standard. For solvers, one standard technique for establishing the quality of the solution is to compute the residual, which is something I do in my own work and which would appear to differ from your approach.

I checked with the CUSPARSE team and got the following feedback (paraphrased by me, I hope that I did not introduce any errors):

(1) Your code appears to use UMFPACK for factorization, then compares the performance of the triangular solve using either CUSPARSE or UMFPACK. As far as is known, UMFPACK uses internal data structures that are generated during the factorization stage to speed up its triangular solve, such as the tracking of dense portions. This information is not available to CUSPARSE.

(2) Care must be exercised when extracting data from UMFPACK into CUSPARSE format so the resulting data conforms to CUSPARSE requirements. In particular, diagonal elements must be present and indices must be sorted by column within each row.

(3) You must also account for any permutations, such as reordering to minimize fill-in and pivoting, that were applied during factorization, as CUSPARSE is not aware of these.

(4) CUSPARSE’s triangular solve is targeted at sizes > 50K, ideally > 100K

(5) If you have an s.p.d. matrix you might want to give CHOLMOD a try; it is accelerated by offloading large dense calls to the GPU (http://www.cise.ufl.edu/research/sparse/cholmod/)

Hello njuffa,

Thank you for your excellent feedback, I really appreciate your efforts. I will give CHOLMOD a try and see if it helps.

Regards.