performace question:sparse matrix multiplication

mimichw · April 25, 2007, 10:56am

I used iteration method to solve a finite element question.

In the iteration method , we have to execute dot , saxpy , and sparse matrix

multiplication. First two , dot and saxpy didn’t have any performace problem.

but, sparse matrix multiplication(q = A*p) gets a problem.

I configued 32 multiprocessors(each multiprocessor have 16 threads) to compute

its q’s element ( more than 32 multiprocessors will cause wrong result because

data dependence )

More detail included in .txt file.

Anybody help me , very thanks.
parallel_code_.cu_.txt (2.17 KB)
sequential_code.txt (713 Bytes)

gserban · April 25, 2007, 3:05pm

Hi. I basically have the same problem as you have. I have implemented an initial version of the kernel, but as am I am new to CUDA I only have 1.3GFlops. I am working to improve on that but I do not know exactly how. I know I must have bank conflicts but currently I do not know how to solve them. I am also thinking to use textures, at least for the vector, which should benefit from caching. I paste the kernel code below. I am assuming the order of the matrix to be a multiple of 255. I chose this number since I am using 256 threads, I want to use them all to read the row indices at the beginning for all the lines processed by the block, and the number of elements in the row index vector should be the number of lines + 1.

I am probably not the best one to give advice, but regarding your code:

32 seems a too small number of threads
You are not using shared memory to coalesce you reads, you read all the time directly from the global arrays

I tried to coalesce my reads as much as I could … didn’t do much of a job yet …

Regards,

Serban

#include <stdio.h>

#define MAX_ELEMENTS_PER_LINE 27

__global__ void spmvx_v1_kernel(const float *d_vals, const int *d_colIndex, const int *d_rowIndex, const float *d_b, float *d_res)

{  

	int iSet;

	

	/* Get various data */

	const int tx = threadIdx.x;

	const int bx = blockIdx.x;

	const int numThreads  = blockDim.x;

	const int numLinesToProcess  = numThreads - 1;  	

	const int numLinesProcessedOnce = numThreads / MAX_ELEMENTS_PER_LINE;

	const int numSets = numLinesToProcess / numLinesProcessedOnce;

  

	/* Alloc shared memory */

	

	__shared__ int   rowIndex[256];            // Read for all lines of the blovk	

	__shared__ int   colIndex[256];          // Just for a set of 255/27 = 9 lines

	__shared__ float vals[256];       // Just for a set

	__shared__ float b[256];        // Just for a set  

        

        /* Compute what to process */                            

	// First and last row processed by the block

	const int rowStart = numLinesToProcess * bx; 

	const int rowEnd   = rowStart + numLinesToProcess;

	/* Read the rowIndex data for all rows using all threads*/

	

	rowIndex[tx] = d_rowIndex[rowStart + tx];

	__syncthreads();

	

	/* Now process the full sets of lines (the rest later)*/

	

	for (iSet = 0; iSet < numSets; iSet++)

	{  

  // Compute the locations of this set

  int setStart = rowIndex[iSet * numLinesProcessedOnce] - 1;

  int setEnd = rowIndex[(iSet + 1) * numLinesProcessedOnce] - 1;

  int setLength = setEnd - setStart;

  

  // Read the colIndex, vals and b

  if (tx < setLength)

  {

    colIndex[tx] = d_colIndex[setStart + tx] - 1;

    vals[tx]     = d_vals[colIndex[tx]];

    b[tx]        = d_b[colIndex[tx]];

  }

  

  __syncthreads();

  

  // Compute the product for all rows in the set and write it

  if (tx < numLinesProcessedOnce)

  {	

  	// Compute

  	float sum = 0;

  	int offset = iSet * numLinesProcessedOnce;

  	for (int i = rowIndex[tx + offset] - 1; i < rowIndex[tx + offset + 1] - 1; i++)

    sum += vals[i - setStart] * b[i - setStart];

  	// Write to global mem

  	d_res[rowStart + iSet*numLinesProcessedOnce + tx] = sum;

  }

   

	}

	

	/* Now do the same for what is left of the last set */

	

	{	

  // Compute the locations of this set

  int setStart = rowIndex[numSets * numLinesProcessedOnce] - 1;

  int setEnd = rowIndex[numLinesToProcess] - 1;

  int setLength = setEnd - setStart;

  

  // Read the colIndex, vals and b

  if (tx < setLength)

  {

    colIndex[tx] = d_colIndex[setStart + tx] - 1;

    vals[tx]     = d_vals[colIndex[tx]];

    b[tx]        = d_b[colIndex[tx]];

  }

  

  __syncthreads();

  

  // Compute the product for all rows in the set and write it

  if (tx < numLinesToProcess - (iSet) * numLinesProcessedOnce)

  {	

  	// Compute

  	float sum = 0;

  	int offset = iSet * numLinesProcessedOnce;	

  	for (int i = rowIndex[tx + offset] - 1; i < rowIndex[tx + offset + 1] - 1; i++)

    sum += vals[i - setStart] * b[i - setStart];

  	// Write to global mem

  	d_res[rowStart + iSet*numLinesProcessedOnce + tx] = sum;

  }

   

	}

}

gserban · April 25, 2007, 3:49pm

Hey guys,

If someone can help with an idea, I summarize my problem. I want to do sparse matrix vector multiplication. I have a number of blocks ( n / blockDim.x) each doing 255 rows.
Since the matrix is packed in CSR, for each line I need to:

read the start and end of the row
read all values
read the colum indices
read the corresponding elements of the vector I am multiplying with
multiply

I am doing this in the following way:

Read the start and end values for all lines in shared memory using all 256 threads for all 255 lines processed by the block
Since the maximum number of elements on each row is 27 in my case and I have 256 threads to read with, I divide the lines into sets of 256/27 = 9 lines.
For each set I read for all lines at once, first the column indices, then the values and corresponding vector elements; this read uses 9*27 = 243 threads from the 256
Then I use 9 threads to do the summations, one for each line I process and I write to global memory.

Doing that I get 1.3GFlops … can someone help?
I thought of using 512 threads, but there will be too many unused after the reading
I thought that less threads but still get about the same (about 1.4GFlops for 128 and 1.5GFlops for 64). So this does not seam to be the main bottleneck.
I know I am using a lot of conditions, but I don’t know how to get ride of them …
Also textures … maybe that might help …

Any help would be most welcome,

Regards,
Serban

mimichw · April 26, 2007, 4:52am

you give me some idea to prove my .cu code.

CRS formate depent on three vectors : value vector , column index vector and

row index vector . Last two of three vectors column and row vector , i think it is

a problem because every q’s element depent on it.

Every time multiprocessor computes its q’s elements must load operands more

time than stored value in a normal array. (ex : A[ ia[i] ]).

thanks for your reply. :)

also , i had too less threads and too less MP.

i should change code to process more.

Topic		Replies	Views
SpMV implementation problem CUDA Programming and Performance	3	4224	April 17, 2009
Sparse Matrix-Vector Multiplication on CUDA CUDA Programming and Performance	79	314619	November 22, 2010
Using spmv with large sparse matrix CUDA Programming and Performance	11	29681	March 30, 2011
matrix-vector mult in CSR format CUDA Programming and Performance	1	1179	November 12, 2010
Sparse Matrix -> Vector CUDA Programming and Performance	7	1318	January 28, 2011
Any ideas on GF(2) sparse matrix-vector multiply? CUDA Programming and Performance	5	5971	August 7, 2008
Basic sparse matrix question CUDA Programming and Performance	0	3756	March 25, 2007
optimization of a sparse matrix vector multiplication kernel CUDA Programming and Performance	0	435	September 19, 2017
GTX 1080 60x slower than i7 6700k CUDA Programming and Performance	1	470	December 7, 2018
Problems allocating sparse Matrix CUDA Programming and Performance	3	2548	March 15, 2010

performace question:sparse matrix multiplication

Related topics