Sparse Matrix-Vector Multiplication on CUDA

I’m pleased to announce the release of our tech report “Efficient Sparse Matrix-Vector Multiplication on CUDA”. It can be challenging to implement sparse matrix operations efficiently, so I hope this report offers some guidance to those working on iterative solvers and other areas where sparse matrix-vector products arise. Feel free to follow up with comments/questions about the report.

Update: A collection of test matrices is now available

Update: Supercomputing '09 SpMV paper is now available

Update: New and improved source code is now available (Requires CUDA 2.2 or newer)

Update: Cusp sparse matrix library has been released (Requires CUDA 3.0 or newer)

Abstract:

nvr_2008_004.pdf (3.44 MB)

Very interesting…

Since I am interested on it can you give an rough estimate of when the code will be avaiable (say some days, a week,a month…)? thanks!

This is very useful. Thanks a lot guys.

is it possible to point out where i can get the data for the test matrices ?

also when reporting the final results, did you guys use texture cache to read x or without texture cache ?

I would estimate one or two weeks.

Some of the matrices are online here:

http://www.cis.ufl.edu/research/sparse/mat…oeing/pwtk.html (Wind Tunnel)

http://www.cis.ufl.edu/research/sparse/mat…n/rail4284.html (LP)

The full set of matrices is over 200MB in compressed format. I’ll find a way to put them online.

Yes, we do use the texture cache in the comparison to the multicore systems.

Thanks

If these matrices are already available on UF Sparse Matrix page or on the Matrix Market page, you can just point them out.

All the matrices are available here:

http://bebop.cs.berkeley.edu/matrices/multicore-subset/

Those files use the Harwell-Boeing format [1]. You can use the BeBOP Sparse Matrix Converter [2] to convert them to another format. Our code (which I will post soon!) uses the MatrixMarket file format [3], so I’ll put MM versions of the files online when the code has been released.

[1] http://math.nist.gov/MatrixMarket/formats.html#hb

[2] http://bebop.cs.berkeley.edu/smc/

[3] http://math.nist.gov/MatrixMarket/formats.html#MMformat

nbell, terrific work! Will this be made available in future versions of the CUDA SDK or in cudpp?

How does your approach compare to block-CSR as suggested by Buatois et al., http://alice.loria.fr/publications/papers/…s_et_al_CNC.pdf ? I haven’t found any comparison in your otherwise brilliant paper.

dom

@nbell, thanks for the links.

Hi Dominik,

A (possibly different) version of this code should appear in the CUDA 2.2 SDK. However, we’ll release the complete source code in the next week or so here.

Unfortunately, I don’t have any direct comparisons to the CNC work. Blocking is often a very useful optimization as it improves locality and reuse of the “source” vector (i.e. x in y = A*x). This is nearly always a win when the matrix has a natural blocksize, such as a vector-valued problem with regular KxK subblocks. The CNC paper uses blocking even though the underlying problem is scalar-valued. In their application (mesh-based matrices) this proves to be advantageous because the subblocks have a reasonable number of nonzeros due to the ordering of rows and columns. However, there are many problems that do not benefit from blocking, so formats like block-CSR aren’t always beneficial.

We’d definitely like to explore this area though. I think it’s likely that different strategies are needed to handle different block sizes optimally. For example, the best method for 2x2 blocks will probably be different than the best method for 6x6 blocks, since you run into a different set of constraints.

Hi nbell,

I agree, blocking is not always beneficial (or better, that the optimal blocking technique is application-dependent). I’ve toyed around a bit with renumbering strategies like Cuthill-McKee and the like, and for our FEM-CFD matrices, this helps at creating a reasonable number of nonzero values per average block.

Anyway, having code available for general SpMV is definitely one of the missing links between CUDA and (some) real-world applications, so congrats again!

dom

Yea, The Links wre useful to me too, Waiting for the CUDA version of it.

I’ve attached a zip file containing the source code to the original post. The included README.txt should provide enough info to get you started. Please let me know if you have any trouble with the code!

Once I get the complete set of matrices online I’ll add another link.

Dear nbell,
Is it best to use your method for calculating sparse matrix matrix multiplication?

Hi,

Thank you very much Nathan for making your code available,

We are excited to see this technology coming out,

About blocking, there is a possible strategy (that we did not include yet in the Concurrent Number Cruncher):

One can construct the sparsity pattern for different blocking sizes, and evaluate the best block size to use based
on the filling ratios. Since constructing the sparsity pattern costs less than constructing the whole sparse matrix,
several block sizes can be reasonably tryed. I have read a while ago a paper that used this strategy (I do not
remember where, I will post the reference if I find it)

Regards,
– Bruno Levy

Hi Kagami,

Are you referring to the case AX where A is a sparse matrix but X is a dense matrix? If so, you can use the code above, and simply compute Ax several times, where x is a single column of the matrix X. Alternatively, you could write a kernel that did multiple columns at the same time, as we suggest in the “Future Work” section of the paper. This optimization can improve performance considerably.

If you’re referring to the case A*X where both A and B are sparse matrices, then that’s a harder problem that this code does not address. FWIW I have implemented this operation in another code (an algebraic multigrid (AMG) solver), so if you’re interested I can provide some assistance.

Are you referring to the “Sparsity” package [1]? They discuss a simple model based on filling ratios that guides the selection block size. Since the Sparsity paper, there has been some work on “variable blocking” where, I believe, one first extracts all the larger blocks (e.g. 4x4) and then extracts smaller blocks (4x1, 2x2, etc.). I haven’t experimented with these methods myself, but it seems like a good idea.

[1] http://hpc.sagepub.com/cgi/reprint/18/1/135

@nathan

I have been playing around with source code to understand it. One thing I have noticed is that some of the calls that I have commented out are still executing

Say for example i comment out test_dia_matrix_kernels(csr) and test_csr_matrix_kernels(csr) - these calls are executed even though i have commented them out.
Can you tell me why this is happening ?

I have recompiled them and tried it a few times but they are still getting executed.

EDIT: Nevermind. Some stupidity on my part. I was recompiling the same object files again and again.

Hi nbell,

I just ask to make sure that your program is not suitable for calculating matrix-matrix multiplication in which both matrices are sparse.

I would contact you to ask for more information.

Thank you very much for your consideration.

I would be very interested in the Sparse Matrix - Sparse Matrix multiplication operator, too. nbell, would you mind to post some informations

about it here? I also wanted to ask if there are some considerations of including this operation in the native cuda/cublas library in some

future SDK?

Regards, Janick