Massive Matrix Multiplication Discussion on the best way to create Massive Matrix Multiplication

kng · March 18, 2010, 1:14pm

So guys lets assume i have (1xN) matrix named A and another sparse Matrix (NxN) named B

let say our space matrix is represented as an array of integers containing an array of integers, this can best be illustrated through a visual

0: 1
1: 1 2
2: 3
3: 3 4
…

the integers in this matrix represent in which column position i can find a 1. So lets say N = 4 the above representation would correspond to the following matrix

0 1 0 0
0 1 1 0
0 0 1 0
0 0 1 1

and i required to multiply these matrix C = A*B a number of times lets say 100 times each time using the result for the next iteration e.g

Iteration 1: C1 = AB
Iteration 2 C2 = C1B
Iteration 3 C3 = C2*B
…

now lets assume that matrix B takes up 2GB of main memory in order to store and it is fully optimized and compressed. And that this matrix B is unable to fit entirely into CUDA global memory therefore we require to use a block based strategy similar to the one discussed in the CUDA Developer Guide.

what would be the optimal data structure to maximize performance? Please express you opinion on how you would structure the data in order to optimize performance and discuss these solutions

Sarnath · March 18, 2010, 1:15pm

Check supernodal technique

kng · March 18, 2010, 1:20pm

At the risk of sounding arrogant and lazy would you mind giving more information about supernodal techniques since i did a Google search and it seems to be a term which is widely used :P

Jimmy_Pettersson · March 18, 2010, 2:19pm

On fast S(GE)MM : [url=“http://www.cs.utk.edu/~dongarra/WEB-PAGES/cscads-libtune-09/talk21-volkov.pdf”]http://www.cs.utk.edu/~dongarra/WEB-PAGES/...lk21-volkov.pdf[/url]

kng · March 18, 2010, 2:44pm

I saw the Slide quite useful in optimizing your code but not quite what i m looking for.

as u saw my data structure on main memory consistent of an array of array. And from what i read on the forums CUDA wouldn’t work well with such a data structure due to coalescing so how would you represent the data structure in CUDA to represent a block.

i first thought of serializing it into a single dimensional array

0: 1

1: 1 2

2: 3

3: 3 4

…

1, -1, 1, 2, -2, 3, -3, 3, 4, -4,

where the -X indicates the row number but after some though this implementation wouldn’t work well on CUDA due to the fact that is will require branching which is undesirable

then i though of padding the value with 0 but seeing the largest ROW contains over a million entries in it array it would also be unfeasible to pad the values with zeros since other then wasting memory space i would also be wasting computational effort multiplying threads with the 0 value .

1, 0, 0, 0, 1, 2, 0, 0, 3, 4, 0, 0,

Any ideas?

Jimmy_Pettersson · March 18, 2010, 4:51pm

Just what i could think of from the top of my head…

I would keep an array in the constant memory space that keeps track of “how wide each row is”. (only valid for N < 16384 )

/*

0: 1

1: 1 2

2: 3

3: 3 4

Yields

*/

[1 2 1 2 .... ]

I would allocate everything linearly in global memory but without any padding it is going to be difficult to make things coalesce.

honestly i don’t have any good ideas for this right now :)

kng · March 18, 2010, 5:21pm

Just what i could think of from the top of my head…

I would keep an array in the constant memory space that keeps track of “how wide each row is”. (only valid for N < 16384 )
/*

0: 1

1: 1 2

2: 3

3: 3 4

Yields

*/

[1 2 1 2 .... ]
I would allocate everything linearly in global memory but without any padding it is going to be difficult to make things coalesce.

honestly i don’t have any good ideas for this right now :)

As i said the solution needs to be able to handle an N which is roughly 7,000,000

but was can be done is that the block can be partitioned example for

A ( NxN sparse matrix)

================

0: 1

1: 1 2

2: 3

3: 0 2

================

B ( 1XN dense matrix )

================

2

3

4

5

================

Partitioning A it into blocks of say N = 2 would result in

Column 1

Block A1.1

================

adjacency range 0 -1

0: 1

1: 1

================

Block A1.2

================

adjacency range 2 -3

0:

1: 2

================

Column 2

Block A2.1

================

adjacency range 1 -2

2:

3: 0

================

Block A2.2

================

adjacency range 2 -3

2: 3

3: 2

================

Partitioning B into blocks of N = 2

Block B1

================

2

3

================

Block B2

================

4

5

================

Multiplying

A1.1B1 + A2.1B1

A1.2B2 + A2.2B1

Would get the same result. But can anyone think of any other way this can be done without padding? since padding will involve a lot of computation and memory effort which is wasted multiplying 0’s which is unwanted?

Or anyone cant think something outside the box since i m quite new to CUDA your more welcome then to do so

avidday · March 18, 2010, 5:56pm

I am not sure I see what the challenge is. There are already some high performance sparse matrix multiplication kernels available here which can compute the multiply, and partitioning tools like metis that can partition the matrix for you. One you have the matrix in block form, just compute the sub matrix products and sum/assemble them on the host into the final product vector. You probably won’t have to write a line of device code to do it. Or am I missing something?

kng · March 18, 2010, 10:04pm

Thanks for the information very usefull i read the documentation and get to you asap

Sarnath · March 19, 2010, 5:47am

knq,
If you are trying to do spMV - I think the link avidday gave is the best work from NV. spMV is commonly used in iterative solvers…

kng · March 19, 2010, 5:16pm

Is their any code examples or better explanations on how to use the libraries since all i found are papers on how they work algorithmically?

Topic		Replies	Views
Matrix Partitioning how would you implement it? Sparse matrix partitioning CUDA Programming and Performance	16	14710	March 30, 2010
Basic sparse matrix question CUDA Programming and Performance	0	3730	March 25, 2007
Any ideas on GF(2) sparse matrix-vector multiply? CUDA Programming and Performance	5	5882	August 7, 2008
performace question:sparse matrix multiplication CUDA Programming and Performance	3	6951	April 26, 2007
Sparse data on GPU CUDA Programming and Performance	1	1080	December 19, 2008
Optimising a Sparse Matrix Routine for a 3D Grid? CUDA Programming and Performance	4	1679	March 4, 2010
Sparse Matrix -> Vector CUDA Programming and Performance	7	1232	January 28, 2011
Sparse Matrix-Vector Multiplication on CUDA CUDA Programming and Performance	79	313770	November 22, 2010
Matrix help CUDA Programming and Performance	13	1524	September 28, 2010
sparse matrix multiplication that come in block lanczos matrices contain values only 1 or 0 CUDA Programming and Performance	7	1455	April 12, 2011

Massive Matrix Multiplication Discussion on the best way to create Massive Matrix Multiplication

Related topics