Building COO or CSR matrices directly

blade613x · September 11, 2014, 8:56pm

Hello,

I have code that builds matrices that look like this : http://www.cs.berkeley.edu/~demmel/cs267/lecture17/Linearize2DHeat.gif (and then cuBLAS and cuSPARSE solve)

The problem I have is the size of these matrices blows up very, very fast with mesh size. A mesh of 100x100 will generate a matrix like in the image above with millions of elements (~750mb of doubles), with most of them being 0. This is a huge waste of memory, as well as GPU threads that do nothing when the matrix is built where the value is 0. cuSPARSE can convert dense to CSR, but I cannot do that when some of these dense matrices become too large to fit on the GPU memory. Especially when 100x100 meshes are small. To show how bad this problem is, going to a 1000x1000 mesh would require 992 million elements if in dense format (with probably 99.9% of them being zeros).

The solution I believe would be to code the matrices directly into COO or CSR format. Doing that in serial programming would be trivial, but possibly slow. It’s the building of COO or CSR in parallel that is throwing me off. Because I have to associate every thread with an element ID in the COO or CSR arrays.

Been stuck for a little bit on this so any help would be greatly appreciated.

little_jimmy · September 12, 2014, 6:40am

i would think that, even if you can not process the matrix as a whole, you should be able to process sub-matrices, or sections of the matrix at a time

i would consider working with sections of the matrix - each section containing one or multiple rows
and i would consider a 2 step process, each processing the matrix on a section basis

in step 1, you sum-scan a predicate - being 1 if the matrix element is non-zero, and 0 otherwise; the sum-scan should accumulate over sections
in step 2, you calculate csr or coo data for the section, and store it according to the sum-scan of step 1

blade613x · September 15, 2014, 3:12pm

I think I have some figured out (almost) based off a predetermined pattern. I know which elements are going to (possibly) be non-zero beforehand. Those are based of the conditions if

I = J
I-1 = J
J-1 = I
I-offset = J
J-offset = I

offset is problem dependent but known at the time of compile. Then there are some further conditions that can remove a few more this (boundary conditions on the problem). What I’m doing right is doing a vector push to build the COO matrix on the fly. I think this will end up working.

I still waste threads building these arrays, but once the initial one is finished, the known I J indices (or row ptrs for CSR) for COO matrices no longer change, and I just need to recalculate the values at index instead of looping through the conditions listed above.

Topic		Replies	Views
CUSP large matrix-vector multiplication GPU-Accelerated Libraries	2	2150	April 22, 2014
Problem of two large sparse matrices multiplication in cuParse CUDA Programming and Performance	6	3807	November 21, 2016
How to accumulated the COO format matrix element? GPU-Accelerated Libraries cusparse	3	793	July 8, 2021
Product Matrix Matrix (CSR Format) CUDA Programming and Performance	0	3588	July 17, 2009
Problem with cusparseXcoo2csr and cuda-gdb CUDA Programming and Performance	9	1559	January 12, 2012
Cusparse coo2csr not working with arrays in host memory, but works when in device memory GPU-Accelerated Libraries cusparse	3	1495	September 27, 2021
Efficient way of Initializing a dense matrix by sparse matrix (COO) at shared memory nvc, nvc++ and nvfortran	0	309	January 26, 2021
CUSPARSE: multiplying two sparse matrices (one of them has rows with complete zeroes) CUDA Programming and Performance	0	2041	June 13, 2012
Sparse matrix creation in parallel CUDA Programming and Performance	1	835	July 8, 2013
CuSparse, Kepler and big matrices GPU-Accelerated Libraries	3	947	August 19, 2014

Building COO or CSR matrices directly

Related topics