# Sparse Matrix-Matrix Multiplication

Hi,

I’m really new to CUDA, so please bear with me if I’m not at the same pace as some posters.

I’m implementing a statistical text analysis program and have it running nicely on C/OpenMP. However, my matrix-matrix multiplications are taking far too long (~20 seconds), resulting in my entire program taking about a month to compute.

I’ve got this pseudo-code that I want to compute in my C program.

D=1./(1+exp(0 -A*B - C)); EQN 1

A is 1,000x10,000 (a highly sparse document-count vector)

B is 10,000x5,000

C is 1*5,000

D is 10,00x5,000

Now I wrote an O(N^3) function using dense matrices to compute this:

[codebox]

void calc_activations(float *A,float *B,float *C,float *D,long batch_len,int num_hid,long num_dims)

{

``````    /////

//batch_len=1000, num_hid=5000, num_dims=10000

/////

double summation,activation;

int i,j,k;

#pragma omp parallel for private(i,j,k,summation,activation)

for(i=0;i<(int)batch_len;i++)

{

for(j=0;j<(int)num_hid;j++)

{

summation=0.0;

for(k=0;k<(int)num_dims;k++)

{

if(A[(int)(i*num_dims)+k]!=0)   //small efficiency

summation=summation+A[(int)(i*num_dims)+k]*B[k*num_hid+j];

}

activation=f_logistic(0-summation-C[j]);   //f_logistic simply returns 1/1+exp(x)

D[i*num_hid+j]=activation;

}

}
``````

}

[/codebox]

As you can see, I’m computing everything at one go. I would prefer not to compute EQN 1 in stages, i.e.

• compute A*B

• compute D=0-A*B-C

• compute D=1./(1+D)

I’m looking for an efficient sparse matrix-matrix multiplication function that I can modify slightly to do all the computation in one go as in the above function. Is this too much to ask???

I’d love to get my hands on a sparse version of the BLAS standard function SGEMM. I.e. C = alphaAB + beta*C, where alpha=-1.0 and beta=-1.0 and A is represented as a sparse matrix in CSR format!!! B is dense. I would then modify it slightly to also compute the 1/1+x operation.

Any insight is most appreciated.

?

I think that you do not really mean the exponential of a Matrix as a power series but just the exponential function on its elements.

There is a sparce matrix to vector multiplication one can simply regard a dense matrix multiplication with a sparce matrix as a series of vector and sparce matrix multiplication so:

http://www.nvidia.com/object/nvidia_research_pub_001.html

It has also the code.

Also in activation=f_logistic(0-summation-C[j]); do you really mean this or do you mean : -summation-C[i*num_hid+j] why do you use the zero it is a useless addition.

One nice exercise can be to feed the whole matrix to the kernel and do it inide the kernel.

There are two solutions here on this:

One is to do a for loop inside the kernel or unfold the matrix in a vector and do in parallel all the calculations. The second is fast the first is slow. You should see a huge speedup of less than a second in your dimensions.

Hope that helps.

Alex.

Yes, I mean performing the exponential on each element in the matrix.

Yes, I was thinking of performing the matrix-vector operation multiple times.

I wrote 0-X, but I guess what I really mean is to just negate the X matrix…

I’m looking at modifying the standard NVIDIA matrix multiplier function for now (it is not sparse).

[codebox]

/*

• NVIDIA Corporation and its licensors retain all intellectual property and

• proprietary rights in and to this software and related documentation.

• Any use, reproduction, disclosure, or distribution of this software

• and related documentation without an express license agreement from

• NVIDIA Corporation is strictly prohibited.

• Please refer to the applicable NVIDIA end user license agreement (EULA)

• associated with this source code for terms and conditions that govern

• your use of this NVIDIA software.

*/

/* Matrix multiplication: C = A * B.

• Device code.

*/

#ifndef MATRIXMUL_KERNEL_H

#define MATRIXMUL_KERNEL_H

#include <stdio.h>

#include “matrixMul.h”

#define CHECK_BANK_CONFLICTS 0

#if CHECK_BANK_CONFLICTS

#define AS(i, j) cutilBankChecker(((float*)&As[0][0]), (BLOCK_SIZE * i + j))

#define BS(i, j) cutilBankChecker(((float*)&Bs[0][0]), (BLOCK_SIZE * i + j))

#else

#define AS(i, j) As[i][j]

#define BS(i, j) Bs[i][j]

#endif

////////////////////////////////////////////////////////////////////////////////

//! Matrix multiplication on the device: C = A * B

//! wA is A’s width and wB is B’s width

////////////////////////////////////////////////////////////////////////////////

global void

matrixMul( float* C, float* A, float* B, int wA, int wB)

{

``````// Block index

int bx = blockIdx.x;

int by = blockIdx.y;
``````

``````int tx = threadIdx.x;

``````

// Index of the first sub-matrix of A processed by the block

``````int aBegin = wA * BLOCK_SIZE * by;
``````

// Index of the last sub-matrix of A processed by the block

``````int aEnd   = aBegin + wA - 1;
``````

// Step size used to iterate through the sub-matrices of A

``````int aStep  = BLOCK_SIZE;
``````

// Index of the first sub-matrix of B processed by the block

``````int bBegin = BLOCK_SIZE * bx;
``````

// Step size used to iterate through the sub-matrices of B

``````int bStep  = BLOCK_SIZE * wB;
``````

// Csub is used to store the element of the block sub-matrix

``````// that is computed by the thread

float Csub = 0;
``````

// Loop over all the sub-matrices of A and B

``````// required to compute the block sub-matrix

for (int a = aBegin, b = bBegin;

a <= aEnd;

a += aStep, b += bStep) {
``````

// Declaration of the shared memory array As used to

``````    // store the sub-matrix of A

__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
``````

// Declaration of the shared memory array Bs used to

``````    // store the sub-matrix of B

__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
``````

// Load the matrices from device memory

``````    // to shared memory; each thread loads

// one element of each matrix

AS(ty, tx) = A[a + wA * ty + tx];

BS(ty, tx) = B[b + wB * ty + tx];
``````

// Synchronize to make sure the matrices are loaded

``````    __syncthreads();
``````

// Multiply the two matrices together;

``````    // each thread computes one element

// of the block sub-matrix

for (int k = 0; k < BLOCK_SIZE; ++k)

Csub += AS(ty, k) * BS(k, tx);
``````

// Synchronize to make sure that the preceding

``````    // computation is done before loading two new

// sub-matrices of A and B in the next iteration

}
``````

// Write the block sub-matrix to device memory;

``````// each thread writes one element

int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;

C[c + wB * ty + tx] = Csub;
``````

}

#endif // #ifndef MATRIXMUL_KERNEL_H

int main(int argc,char *argv)

{

``````return 0;
``````

}

[/codebox]

with particular attention to this line:

Csub += AS(ty, k) * BS(k, tx);

Not to mention all the other little bits of code to set up the blocks correctly.

Here is the above .cu’s header file:

[codebox]

/*

• NVIDIA Corporation and its licensors retain all intellectual property and

• proprietary rights in and to this software and related documentation.

• Any use, reproduction, disclosure, or distribution of this software

• and related documentation without an express license agreement from

• NVIDIA Corporation is strictly prohibited.

• Please refer to the applicable NVIDIA end user license agreement (EULA)

• associated with this source code for terms and conditions that govern

• your use of this NVIDIA software.

*/

#ifndef MATRIXMUL_H

#define MATRIXMUL_H

#define BLOCK_SIZE 16

// Basic Matrix dimensions (can be amplified by command line switch)

// (chosen as multiples of the thread block size for simplicity)

#define WA (5 * BLOCK_SIZE) // Matrix A width

#define HA (10 * BLOCK_SIZE) // Matrix A height

#define WB (5 * BLOCK_SIZE) // Matrix B width

#define HB WA // Matrix B height

#define WC WB // Matrix C width

#define HC HA // Matrix C height

#endif // MATRIXMUL_H

[/codebox]

I’ve got it all compiling now. I guess I just need to modify the .cu file to accomodate a C matrix, negate a few variables and incorporate the logistic operation. If I can get it working in dense mode, I expect to see a considerable speed improvement. TBQH, I’m not sure if I’ll even bother with the sparse version if I can get my compute time down from a month to a day, then again, if I can get another 20%, it’d be worth it!

Most definitely. Thanks so much for your insight.

No problem. It would be useful for us to see the time difference on both cases not to mention this could enrich a report that you can make with three diagrams. One for openMP, one for standard block based mult as you mention and one for sparce matrix multiplication.

Alex.

how many nonzero elements do A have?
sparse-vector multiplication depends on sparsity pattern.

A is 1,000x10,000 (a highly sparse document-count vector)
B is 10,000x5,000
C is 1*5,000

if you use dense SGEMM, thne SGEMM can reach 300 Gflop/s.

total time of C = AB + C is 2(1000 * 10000 * 5000)/300 ~ 3 second.

you can write a seperate kernel to compute D = 1./exp(-(A*B+C)).

I think that you have 6x times speedup if using dense sgemm.

I think with a GT200 it gets a little higher than 300. You are referring of course to cublasSgemm() of the CUBLAS library of CUDA.

Hi,

Thanks for that. A is quite sparse, in that each row only has about 50 to 150 non-zero elements.

Do you reckon I’ll only get an x6 speed-up? I’m gonna be hitting the code first thing in the morning. I’ve been taking it easy this weekend!

Ok, I’ll post some OpenMP/GPU, sparse/dense reports as soon as I can.

I’ve only got a GTX 8800/Quad core under my desk, though I do have access to a cluster with Fermi capability. I don’t want to look like a dork in front of the sys admin, so I’m gonna hold off for another week or two…

Exactly good thinking do not waste resources when your code is not ready and you can do it at your home. :-)

Remember that your card has just 16SPs opposed to the 30SPs of a GT200. So do not expect this speedup. Expect sth like a 3x. With the Fermi case it should be quite fast I would say it can exceed halp TF but do it when you are ready since you do not own the resource.

Best,

Alex.

A is 1,000x10,000 (a highly sparse document-count vector)

B is 10,000x5,000

C is 1*5,000

O.K. I think that Bell’s work (sparse matrix-vector product) is good for this matrix A.

Let us estimate the performance on your case.

Suppose performance of SpMV is 15Gflop/s (unstructured grid on single precision, see Bell’s paper).

non-zero elements of A is less than 1000 x 150

one SpMV needs 2 * (1000 x 150) / 15 Gflop/s = 2 E-5 second

you need to do 5000 SpMV.

total time of SpMV is 5000 x 2E-5 = 0.1 second

(kernel launch overhead ~ 10 micro second, 5000 kernel launch needs 5ms)

So you can try SpMV.

A good exercise is to do all 5000 in parallel unpack the matrix into a single vector and do everything on the kernel side instead of 5000 kernel invocations on the host, I would put half the grade on the second case. :-)

Update:

A 1,000x10,000 * 10,000x5,000 dense C=A*B operation takes:

~230 seconds on one core using a bog-standard O(N^3) multiplication

~29 seconds on eight cores @2.5 GHz (OpenMP)

~.15 seconds on the GPU (including memory transfer) (GT 8800)

I’m working on the D=1/(1+e^(0-A*B+C)) kernel now - UPDATE: this kernel takes 0.17 seconds (including memory transfer). Here is the code:

``````__global__ void calcActivationsKernel(Matrix A,Matrix B,Matrix C,Matrix D)

{

// Block row and column

int blockRow=blockIdx.y;

int blockCol=blockIdx.x;

// Each thread block computes one sub-matrix Dsub of D

Matrix Dsub=GetSubMatrix(D,blockRow,blockCol);

// Each thread computes one element of Csub

// by accumulating results into Cvalue

float Dvalue=0;

// Thread row and column within Csub

// Loop over all the sub-matrices of A and B that are

// required to compute Csub

// Multiply each pair of sub-matrices together

// and accumulate the results

for(int m=0;m<(A.w/BLOCK_SIZE);++m)

{

// Get sub-matrix Asub of A

Matrix Asub=GetSubMatrix(A,blockRow,m);

// Get sub-matrix Bsub of B

Matrix Bsub=GetSubMatrix(B,m,blockCol);

//Get sub-matrix Csub of C

Matrix Csub=GetSubMatrix(C,0,blockCol);

// Shared memory used to store Asub and Bsub respectively

__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

__shared__ float Cs[0][BLOCK_SIZE];

// Load Asub and Bsub from device memory to shared memory

As[row][col]=GetElement(Asub,row,col);

Bs[row][col]=GetElement(Bsub,row,col);

// Synchronize to make sure the sub-matrices are loaded

// before starting the computation

// Multiply Asub and Bsub together

for (int e=0;e<BLOCK_SIZE;++e)

{

Dvalue-=As[row][e]*Bs[e][col]-Cs[0][col];

}

// Synchronize to make sure that the preceding

// sub-matrices of A and B in the next iteration

}

// Write Csub to device memory

// Each thread writes one element

SetElement(Dsub,row,col,1/(1+exp(Dvalue)));

}
``````

Here are the functions/structs I’m using:

``````typedef struct{

int w;

int h;

float* data;

}Matrix;

__device__ float GetElement(const Matrix A,int row,int col)

{

return A.data[row*A.w+col];

}

__device__ void SetElement(Matrix A,int row,int col,float value)

{

A.data[row*A.w+col]=value;

}

// Get the BLOCK_SIZExBLOCK_SIZE sub-matrix Asub of A that is

// located col sub-matrices to the right and row sub-matrices down

// from the upper-left corner of A

__device__ Matrix GetSubMatrix(Matrix A,int row,int col)

{

Matrix Asub;

Asub.w=BLOCK_SIZE;

Asub.h=BLOCK_SIZE;

Asub.data=&A.data[A.w*BLOCK_SIZE*row+BLOCK_SIZE*col];

return Asub;

}
``````

Although I’m not sure if this is right or not… I’ll have to ask the HPC guy tomorrow. Also, I should minimise unnecessary memory transfers…

Will edit this post when finished

I’ll then implement it all again using an A_sparse matrix and edit this post with the results.

Exactly it is as I expected less than a second. Great continue the nice work.

How do you sample elapse time of GPU kernel?

The timing is unreasonable.

If you consider host to device memory transfer, then

typical bandwidth of PCIE is 2GB/s.

size of A = 1.E7 elements

size of B = 5E7 elements

size of C = 5E6 elements

time of host <–> device = 6.5E7 * 4 byte / 2GB/s = 0.13 second

The execution time of your kernel C = A*B + C is very small?

If you have put cudaThreadSynchronize() after the kernel execution and you have verified it against the CPU code then your timing is correct, if not do it and post again your timing. LSChien is right and this timing is usually acquired acquired when you ommit the thread synchronization. With your card it should be though around 1-2sesc I believe with these dimensions. Your ptrogramming style is ok though I like it vety much. Just be careful with the timings.