Matrix Partitioning how would you implement it? Sparse matrix partitioning

So we have a 6,000,000 x 6,000,000 matrix of which only 130,000,000 million are NNZ (none zeros) which means only 0.002167% of the matrix is non zeros.
The above matrix needs to be multiplied to a vector containing 6,000,000 entries all of which are non zero.

how would you partition and how would you represent Blocks inside this matrix in order to achieve matrix multiplication using CUDA without utilizing hard disk swapping
P.S System details 12GB main memory and GTX260 graphics card

They are represented on disk in this manner

1: 2 3
3: 1 2 5
4: 5 6
5: 4 6
6: 4

were the initial value represent the Node ID and anything behind the “:” represents out links to other nodes the above data structure forms a sparse matrix
(P.S Node 2 contains no out links therefore it is omitted from the file)

P.S i already tried partitioning the matrix using Ellpack R as specified in this paper but the data structure was using more than the main memory i have therefore wasn’t a feasible solution.

please do express anything you might feel is relevant since i am honestly stuck and need urgent help

So if I am to understand correctly, you actually have three problems - the first being how to parse the structure you have into a space efficient sparse matrix, the second being how to partition that space efficient matrix to perform the multiplication in blocks that fit into the GPU memory, and the third being how to do the multiplication itself. Is that a fair synopsis?

the first being how to parse the structure you have into a space efficient sparse matrix - this i have done successfully and i am able to represent the whole sparse matrix on main memory using 1.2GB of main memory

the problem is that this structure saves the sparse matrix as a whole not as individual block.

second being how to partition that space efficient matrix to perform the multiplication in blocks that fit into the GPU memory - This is what i m having trouble with since i want to be able to partition the matrix prior the multiplication instates so that i can iterate though each block each time copying the block from main memory to gpu memory.

I have already done an algorithm where i partitioned the matrix into nxn where n is a constant BLOCK_SIZE and then i used Ellpack-R to represent the blocks containing atleast one non zero element but the thing takes more than 12gb of memory and my worst case scenario mathematics told me it should take 46GB which isn’t nice

where we have 130 million out links each one resides in it own individual block therefore it will contain an 3x array of Block size minimum each containing ints therefore 130,000,000x3xBLOCK_SIZEx4bytes where block size = 32 it would take roughly 46GB which would require i/o disk swapping

third being how to do the multiplication itself - The multiplication i can deduce how it can need to be done alone once i have figured out an appropriate blocking structure which will globally fit into main memory and where individual block will fit on GPU memory

Ok, so what format are you using?

This is the bit I found confusing in your other thread on this same subject as well. The “naïve” scheme would be to use a column wise decomposition (after all the matrix-vector product is simply the inner product of every row of the matrix with the LHS vector). So a column wise decomposition would just take some rows of the sparse matrix, copy them to the GPU, do the multiply using one of the already available sparse matrix-vector multiply kernels, and store the results in the appropriate rows of the RHS vector. Just repeat that as many times as necessary until you have done the whole matrix. Common row-wise sparse formats like CSR make it pretty easy to extract a subset of rows from a sparse matrix.

There are more sophisticated ways to do this, but surely the simplest scheme isn’t all that complex?

i am using a similar format to the one the text file is using i have an array for every row in the matrix and each row contains another array containing the columns where a link exists.

Row 1. link to column 4, link to column 5, …

Row 2. …

Row 3 …

I m not 100% sure if i understood you correctly but if i did what you meant is you multiply the dense vector with each individual row in the matrix right? if that is so your assuming that the vector would fit into global memory or the row vector would fit into memory. although the matrix is a sparse matrix their is one row which contains an excess of 50 000 nnz elements inside of it, which i m not sure will with in GPU memory

No. Just take a set of sparse rows and multiply them with the LHS vector using a sparse matrix-vector kernel. That assumes that at least single sparse row would fit into the GPU memory along with two copies of the vector (LHS and RHS). If that won’t work, use a graph based partitioner like metis. The sparse multiply kernels I already gave you a link for already have demo code with an interface to metis.

So let me see if i understood you correctly by an example

Vector (1 2 3 4 5 6)


Sparse Matrix Representation

1: 2 3

3: 1 2 5

4: 5 6

5: 4 6

6: 4


Next Vecotr (1 2 3 4 5 6)

multiply ( 1 2 3) with

1: 2 3

3: 1 2 5

giving you

Next Vector’ (1 2 3 4 5 6)


multiply (4 5 6) with

4: 5 6

5: 4 6

6: 4

giving you

Next Vector’’ (1 2 3 4 5 6)

Next Vector = Next Vector’ + Next Vecotr’’

is this what you mean? (By the way thanks for you patience and sorry for my english :D)

Regarding metis i am unable to build it using windows if anyone has any good tutorial on how to build metis on windows please do tell

No, that isn’t what I meant.

For Ax=b, with A sparse, decompose the columns of A into a set of row-wise stacked sub matrices (in this example A1, A2, A3 and A4, each of which is a sparse matrix containing whole rows of the parent matrix A), and execute:

| A1 |		| A1 * x |

| A2 | * x =  | A2 * x |

| A3 |		| A3 * x |

| A4 |		| A4 * x |

as four sparse matrix-vector multiplication operations.

Understood and that what i was trying to explain in my example my communication skills need some work :P

regarding building metis on windows

i found this in the make file

anyone know what is meant by make sure that ‘cl’ is on the path?

‘cl’ is the name of the compiler in Visual Studio, just like GCC is ‘gcc’ for C or ‘g++’ for C++. Visual Studio comes with a special ‘Visual Studio XXX Command Prompt’ (where XXX is the version of VS) shortcut that should do this for you.

I haven’t been able to get the Metis source to build using the windows tools; I did manage to get it build in cygwin. I have a feeling none of the developers regularly works in windows, since there are unix-specific function calls that have no widows equivalent. Making it build in MinGW (i.e. using WIN32 defines without using microsoft’s tools) has been slow and painful.

I gave up using metis on windows and have been working on my own implementation and i hit a problem maybe someone would be kind enough to shed some light on the problem.

So below you will find my code example which does matrix multiplication of a vector held in global memory named device_inputVector with a matrix in a hybrid TJDs format given below

values [float] -> 0.5, 0.5, 0.5, … //Containing the values inside the matrix

rowIDs [int] -> 0, 1, 2, … //Contains the rowID of the element to be multiplied with

resultOrder [int] -> 5, 3, 2, … //Contains the index of where the result should be placed

All array have an equal length and the index of each array represents one value

and places the result in device_outputVector. Size is an integer indicating the number of elements in the arrays

__global__ void 

blockMultiply( float * device_inputVector, float * device_outputVector, float * device_values,

					 int * device_rowIDs, int * device_resultOrder, int size)


	__shared__ float shared_values[THREAD_PER_BLOCK];

	__shared__ int shared_rowIDs[THREAD_PER_BLOCK];

	__shared__ int shared_resultOrder[THREAD_PER_BLOCK];



	const int thread_id   = blockDim.x * blockIdx.x + threadIdx.x;				   // global thread index


	if (thread_id > (size - 1))



	shared_values[threadIdx.x] = device_values[thread_id];

	shared_rowIDs[threadIdx.x] = device_rowIDs[thread_id];

	shared_resultOrder[threadIdx.x] = device_resultOrder[thread_id];	


	// Block until all threads in the block have 

	// written their data to shared mem



	float tempResult = device_inputVector[(shared_rowIDs[threadIdx.x] - 1)] * shared_values[threadIdx.x];


	device_outputVector[shared_resultOrder[threadIdx.x]] += tempResult;


The problem is when i run this code in emulation mode it gives the correct results but when i run it on the actual CUDA device the results are incorrect. (Also if anyone notices any performances changes i can make in order to make the code above more efficient let me know please

Thanks :D

I think the problem is related to

device_outputVector[shared_resultOrder[threadIdx.x]] += tempResult;

since while thread 1 and thread 2 are doing the same operation at the same time due toe SIMD architecture they are overwriting their results is their any work around?

A = device_outputVector[shared_resultOrder[threadIdx.x]];


device_outputVector[…] = A + tempResult;

Assuming that you dont need to synchronize among blocks… If that is also required then you need an algo revamp

Still causing the same error since what is happening i think is that for example

RowIDs -> 1, 3 ,5, 6, 3 , 1

Values -> 0.5, 0.33, 0.5, 1, 0.33, 0.5

ResultOrder -> 1, 1, 3, 3, 0, 2

device_inputVector -> 1, 2, 3, 4, 5, 6

does are the elements contained in shared memory of lets say Block A so what is happening is the following

Block A create a wark which runs 32 threads in Single instruction multiple Data

Warp 1 - > Thread 0

it mutiplies device_inputVector[RowIDs[0]] with Values[0] and placing the result in device_outputVector[ResultOrder[0]]

which get translated to (0.5X2) += device_outputVector[1]

Warp 1 -> Thread 1

does the same thing except instead of thread id being 0 it is 1 but the result is still outputed to the same device_outputVector[1] therefore due to SIMD architecture the results are being overwritten

Dont know is this what is actually happening? and what can be done in order to work around this?

lets say this gets translated to

Thread 1

device_outputVector[0] = A + tempResult;

Thread 2

device_outputVector[0] = A + tempResult;

Thread 3

wouldn’t the results still over write each other or am i missing something?

I thought the sharedResultOrder was an unique permutation. If it is not a unique permutation, you just cant guarantee race-free execution.