I have code that builds matrices that look like this : http://www.cs.berkeley.edu/~demmel/cs267/lecture17/Linearize2DHeat.gif (and then cuBLAS and cuSPARSE solve)
The problem I have is the size of these matrices blows up very, very fast with mesh size. A mesh of 100x100 will generate a matrix like in the image above with millions of elements (~750mb of doubles), with most of them being 0. This is a huge waste of memory, as well as GPU threads that do nothing when the matrix is built where the value is 0. cuSPARSE can convert dense to CSR, but I cannot do that when some of these dense matrices become too large to fit on the GPU memory. Especially when 100x100 meshes are small. To show how bad this problem is, going to a 1000x1000 mesh would require 992 million elements if in dense format (with probably 99.9% of them being zeros).
The solution I believe would be to code the matrices directly into COO or CSR format. Doing that in serial programming would be trivial, but possibly slow. It’s the building of COO or CSR in parallel that is throwing me off. Because I have to associate every thread with an element ID in the COO or CSR arrays.
Been stuck for a little bit on this so any help would be greatly appreciated.
i would think that, even if you can not process the matrix as a whole, you should be able to process sub-matrices, or sections of the matrix at a time
i would consider working with sections of the matrix - each section containing one or multiple rows
and i would consider a 2 step process, each processing the matrix on a section basis
in step 1, you sum-scan a predicate - being 1 if the matrix element is non-zero, and 0 otherwise; the sum-scan should accumulate over sections
in step 2, you calculate csr or coo data for the section, and store it according to the sum-scan of step 1
I think I have some figured out (almost) based off a predetermined pattern. I know which elements are going to (possibly) be non-zero beforehand. Those are based of the conditions if
I = J
I-1 = J
J-1 = I
I-offset = J
J-offset = I
offset is problem dependent but known at the time of compile. Then there are some further conditions that can remove a few more this (boundary conditions on the problem). What I’m doing right is doing a vector push to build the COO matrix on the fly. I think this will end up working.
I still waste threads building these arrays, but once the initial one is finished, the known I J indices (or row ptrs for CSR) for COO matrices no longer change, and I just need to recalculate the values at index instead of looping through the conditions listed above.