Massive Matrix Multiplication Discussion on the best way to create Massive Matrix Multiplication

So guys lets assume i have (1xN) matrix named A and another sparse Matrix (NxN) named B

let say our space matrix is represented as an array of integers containing an array of integers, this can best be illustrated through a visual

0: 1
1: 1 2
2: 3
3: 3 4

the integers in this matrix represent in which column position i can find a 1. So lets say N = 4 the above representation would correspond to the following matrix

0 1 0 0
0 1 1 0
0 0 1 0
0 0 1 1

and i required to multiply these matrix C = A*B a number of times lets say 100 times each time using the result for the next iteration e.g

Iteration 1: C1 = AB
Iteration 2 C2 = C1
B
Iteration 3 C3 = C2*B

now lets assume that matrix B takes up 2GB of main memory in order to store and it is fully optimized and compressed. And that this matrix B is unable to fit entirely into CUDA global memory therefore we require to use a block based strategy similar to the one discussed in the CUDA Developer Guide.

what would be the optimal data structure to maximize performance? Please express you opinion on how you would structure the data in order to optimize performance and discuss these solutions

Check supernodal technique

At the risk of sounding arrogant and lazy would you mind giving more information about supernodal techniques since i did a Google search and it seems to be a term which is widely used :P

On fast S(GE)MM : [url=“http://www.cs.utk.edu/~dongarra/WEB-PAGES/cscads-libtune-09/talk21-volkov.pdf”]http://www.cs.utk.edu/~dongarra/WEB-PAGES/...lk21-volkov.pdf[/url]

I saw the Slide quite useful in optimizing your code but not quite what i m looking for.

as u saw my data structure on main memory consistent of an array of array. And from what i read on the forums CUDA wouldn’t work well with such a data structure due to coalescing so how would you represent the data structure in CUDA to represent a block.

i first thought of serializing it into a single dimensional array

0: 1

1: 1 2

2: 3

3: 3 4

1, -1, 1, 2, -2, 3, -3, 3, 4, -4,

where the -X indicates the row number but after some though this implementation wouldn’t work well on CUDA due to the fact that is will require branching which is undesirable

then i though of padding the value with 0 but seeing the largest ROW contains over a million entries in it array it would also be unfeasible to pad the values with zeros since other then wasting memory space i would also be wasting computational effort multiplying threads with the 0 value .

1, 0, 0, 0, 1, 2, 0, 0, 3, 4, 0, 0,

Any ideas?

Just what i could think of from the top of my head…

I would keep an array in the constant memory space that keeps track of “how wide each row is”. (only valid for N < 16384 )

/*

0: 1

1: 1 2

2: 3

3: 3 4

Yields

*/

[1 2 1 2 .... ]

I would allocate everything linearly in global memory but without any padding it is going to be difficult to make things coalesce.

honestly i don’t have any good ideas for this right now :)

As i said the solution needs to be able to handle an N which is roughly 7,000,000

but was can be done is that the block can be partitioned example for

A ( NxN sparse matrix)

================

0: 1

1: 1 2

2: 3

3: 0 2

================

B ( 1XN dense matrix )

================

2

3

4

5

================

Partitioning A it into blocks of say N = 2 would result in

Column 1


Block A1.1

================

adjacency range 0 -1

0: 1

1: 1

================

Block A1.2

================

adjacency range 2 -3

0:

1: 2

================


Column 2


Block A2.1

================

adjacency range 1 -2

2:

3: 0

================

Block A2.2

================

adjacency range 2 -3

2: 3

3: 2

================


Partitioning B into blocks of N = 2

Block B1

================

2

3

================

Block B2

================

4

5

================

Multiplying

A1.1B1 + A2.1B1

A1.2B2 + A2.2B1

Would get the same result. But can anyone think of any other way this can be done without padding? since padding will involve a lot of computation and memory effort which is wasted multiplying 0’s which is unwanted?

Or anyone cant think something outside the box since i m quite new to CUDA your more welcome then to do so

I am not sure I see what the challenge is. There are already some high performance sparse matrix multiplication kernels available here which can compute the multiply, and partitioning tools like metis that can partition the matrix for you. One you have the matrix in block form, just compute the sub matrix products and sum/assemble them on the host into the final product vector. You probably won’t have to write a line of device code to do it. Or am I missing something?

Thanks for the information very usefull i read the documentation and get to you asap

knq,
If you are trying to do spMV - I think the link avidday gave is the best work from NV. spMV is commonly used in iterative solvers…

Is their any code examples or better explanations on how to use the libraries since all i found are papers on how they work algorithmically?