I’m really new to CUDA, so please bear with me if I’m not at the same pace as some posters.
I’m implementing a statistical text analysis program and have it running nicely on C/OpenMP. However, my matrix-matrix multiplications are taking far too long (~20 seconds), resulting in my entire program taking about a month to compute.
I’ve got this pseudo-code that I want to compute in my C program.
D=1./(1+exp(0 -A*B - C)); EQN 1
A is 1,000x10,000 (a highly sparse document-count vector)
B is 10,000x5,000
C is 1*5,000
D is 10,00x5,000
Now I wrote an O(N^3) function using dense matrices to compute this:
As you can see, I’m computing everything at one go. I would prefer not to compute EQN 1 in stages, i.e.
compute A*B
compute D=0-A*B-C
compute D=1./(1+D)
I’m looking for an efficient sparse matrix-matrix multiplication function that I can modify slightly to do all the computation in one go as in the above function. Is this too much to ask???
I’d love to get my hands on a sparse version of the BLAS standard function SGEMM. I.e. C = alphaAB + beta*C, where alpha=-1.0 and beta=-1.0 and A is represented as a sparse matrix in CSR format!!! B is dense. I would then modify it slightly to also compute the 1/1+x operation.
I think that you do not really mean the exponential of a Matrix as a power series but just the exponential function on its elements.
There is a sparce matrix to vector multiplication one can simply regard a dense matrix multiplication with a sparce matrix as a series of vector and sparce matrix multiplication so:
Also in activation=f_logistic(0-summation-C[j]); do you really mean this or do you mean : -summation-C[i*num_hid+j] why do you use the zero it is a useless addition.
One nice exercise can be to feed the whole matrix to the kernel and do it inide the kernel.
There are two solutions here on this:
One is to do a for loop inside the kernel or unfold the matrix in a vector and do in parallel all the calculations. The second is fast the first is slow. You should see a huge speedup of less than a second in your dimensions.
I’ve got it all compiling now. I guess I just need to modify the .cu file to accomodate a C matrix, negate a few variables and incorporate the logistic operation. If I can get it working in dense mode, I expect to see a considerable speed improvement. TBQH, I’m not sure if I’ll even bother with the sparse version if I can get my compute time down from a month to a day, then again, if I can get another 20%, it’d be worth it!
No problem. It would be useful for us to see the time difference on both cases not to mention this could enrich a report that you can make with three diagrams. One for openMP, one for standard block based mult as you mention and one for sparce matrix multiplication.
Ok, I’ll post some OpenMP/GPU, sparse/dense reports as soon as I can.
I’ve only got a GTX 8800/Quad core under my desk, though I do have access to a cluster with Fermi capability. I don’t want to look like a dork in front of the sys admin, so I’m gonna hold off for another week or two…
Exactly good thinking do not waste resources when your code is not ready and you can do it at your home. :-)
Remember that your card has just 16SPs opposed to the 30SPs of a GT200. So do not expect this speedup. Expect sth like a 3x. With the Fermi case it should be quite fast I would say it can exceed halp TF but do it when you are ready since you do not own the resource.
A good exercise is to do all 5000 in parallel unpack the matrix into a single vector and do everything on the kernel side instead of 5000 kernel invocations on the host, I would put half the grade on the second case. :-)
A 1,000x10,000 * 10,000x5,000 dense C=A*B operation takes:
~230 seconds on one core using a bog-standard O(N^3) multiplication
~29 seconds on eight cores @2.5 GHz (OpenMP)
~.15 seconds on the GPU (including memory transfer) (GT 8800)
I’m working on the D=1/(1+e^(0-A*B+C)) kernel now - UPDATE: this kernel takes 0.17 seconds (including memory transfer). Here is the code:
__global__ void calcActivationsKernel(Matrix A,Matrix B,Matrix C,Matrix D)
{
// Block row and column
int blockRow=blockIdx.y;
int blockCol=blockIdx.x;
// Each thread block computes one sub-matrix Dsub of D
Matrix Dsub=GetSubMatrix(D,blockRow,blockCol);
// Each thread computes one element of Csub
// by accumulating results into Cvalue
float Dvalue=0;
// Thread row and column within Csub
int row=threadIdx.y;
int col=threadIdx.x;
// Loop over all the sub-matrices of A and B that are
// required to compute Csub
// Multiply each pair of sub-matrices together
// and accumulate the results
for(int m=0;m<(A.w/BLOCK_SIZE);++m)
{
// Get sub-matrix Asub of A
Matrix Asub=GetSubMatrix(A,blockRow,m);
// Get sub-matrix Bsub of B
Matrix Bsub=GetSubMatrix(B,m,blockCol);
//Get sub-matrix Csub of C
Matrix Csub=GetSubMatrix(C,0,blockCol);
// Shared memory used to store Asub and Bsub respectively
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float Cs[0][BLOCK_SIZE];
// Load Asub and Bsub from device memory to shared memory
// Each thread loads one element of each sub-matrix
As[row][col]=GetElement(Asub,row,col);
Bs[row][col]=GetElement(Bsub,row,col);
// Synchronize to make sure the sub-matrices are loaded
// before starting the computation
__syncthreads();
// Multiply Asub and Bsub together
for (int e=0;e<BLOCK_SIZE;++e)
{
Dvalue-=As[row][e]*Bs[e][col]-Cs[0][col];
}
// Synchronize to make sure that the preceding
// computation is done before loading two new
// sub-matrices of A and B in the next iteration
__syncthreads();
}
// Write Csub to device memory
// Each thread writes one element
SetElement(Dsub,row,col,1/(1+exp(Dvalue)));
}
Here are the functions/structs I’m using:
typedef struct{
int w;
int h;
float* data;
}Matrix;
__device__ float GetElement(const Matrix A,int row,int col)
{
return A.data[row*A.w+col];
}
__device__ void SetElement(Matrix A,int row,int col,float value)
{
A.data[row*A.w+col]=value;
}
// Get the BLOCK_SIZExBLOCK_SIZE sub-matrix Asub of A that is
// located col sub-matrices to the right and row sub-matrices down
// from the upper-left corner of A
__device__ Matrix GetSubMatrix(Matrix A,int row,int col)
{
Matrix Asub;
Asub.w=BLOCK_SIZE;
Asub.h=BLOCK_SIZE;
Asub.data=&A.data[A.w*BLOCK_SIZE*row+BLOCK_SIZE*col];
return Asub;
}
Although I’m not sure if this is right or not… I’ll have to ask the HPC guy tomorrow. Also, I should minimise unnecessary memory transfers…
Will edit this post when finished
I’ll then implement it all again using an A_sparse matrix and edit this post with the results.
If you have put cudaThreadSynchronize() after the kernel execution and you have verified it against the CPU code then your timing is correct, if not do it and post again your timing. LSChien is right and this timing is usually acquired acquired when you ommit the thread synchronization. With your card it should be though around 1-2sesc I believe with these dimensions. Your ptrogramming style is ok though I like it vety much. Just be careful with the timings.