performace question:sparse matrix multiplication

I used iteration method to solve a finite element question.

In the iteration method , we have to execute dot , saxpy , and sparse matrix

multiplication. First two , dot and saxpy didn’t have any performace problem.

but, sparse matrix multiplication(q = A*p) gets a problem.

I configued 32 multiprocessors(each multiprocessor have 16 threads) to compute

its q’s element ( more than 32 multiprocessors will cause wrong result because

data dependence )

More detail included in .txt file.

Anybody help me , very thanks.
parallel_code_.cu_.txt (2.17 KB)
sequential_code.txt (713 Bytes)

Hi. I basically have the same problem as you have. I have implemented an initial version of the kernel, but as am I am new to CUDA I only have 1.3GFlops. I am working to improve on that but I do not know exactly how. I know I must have bank conflicts but currently I do not know how to solve them. I am also thinking to use textures, at least for the vector, which should benefit from caching. I paste the kernel code below. I am assuming the order of the matrix to be a multiple of 255. I chose this number since I am using 256 threads, I want to use them all to read the row indices at the beginning for all the lines processed by the block, and the number of elements in the row index vector should be the number of lines + 1.

I am probably not the best one to give advice, but regarding your code:

  1. 32 seems a too small number of threads

  2. You are not using shared memory to coalesce you reads, you read all the time directly from the global arrays

I tried to coalesce my reads as much as I could … didn’t do much of a job yet …

Regards,

Serban

#include <stdio.h>

#define MAX_ELEMENTS_PER_LINE 27

__global__ void spmvx_v1_kernel(const float *d_vals, const int *d_colIndex, const int *d_rowIndex, const float *d_b, float *d_res)

{  

	int iSet;

	

	/* Get various data */

	const int tx = threadIdx.x;

	const int bx = blockIdx.x;

	const int numThreads  = blockDim.x;

	const int numLinesToProcess  = numThreads - 1;  	

	const int numLinesProcessedOnce = numThreads / MAX_ELEMENTS_PER_LINE;

	const int numSets = numLinesToProcess / numLinesProcessedOnce;

  

	/* Alloc shared memory */

	

	__shared__ int   rowIndex[256];            // Read for all lines of the blovk	

	__shared__ int   colIndex[256];          // Just for a set of 255/27 = 9 lines

	__shared__ float vals[256];       // Just for a set

	__shared__ float b[256];        // Just for a set  

        

        /* Compute what to process */                            

	// First and last row processed by the block

	const int rowStart = numLinesToProcess * bx; 

	const int rowEnd   = rowStart + numLinesToProcess;

	/* Read the rowIndex data for all rows using all threads*/

	

	rowIndex[tx] = d_rowIndex[rowStart + tx];

	__syncthreads();

	

	/* Now process the full sets of lines (the rest later)*/

	

	for (iSet = 0; iSet < numSets; iSet++)

	{  

  // Compute the locations of this set

  int setStart = rowIndex[iSet * numLinesProcessedOnce] - 1;

  int setEnd = rowIndex[(iSet + 1) * numLinesProcessedOnce] - 1;

  int setLength = setEnd - setStart;

  

  // Read the colIndex, vals and b

  if (tx < setLength)

  {

    colIndex[tx] = d_colIndex[setStart + tx] - 1;

    vals[tx]     = d_vals[colIndex[tx]];

    b[tx]        = d_b[colIndex[tx]];

  }

  

  __syncthreads();

  

  // Compute the product for all rows in the set and write it

  if (tx < numLinesProcessedOnce)

  {	

  	// Compute

  	float sum = 0;

  	int offset = iSet * numLinesProcessedOnce;

  	for (int i = rowIndex[tx + offset] - 1; i < rowIndex[tx + offset + 1] - 1; i++)

    sum += vals[i - setStart] * b[i - setStart];

  	// Write to global mem

  	d_res[rowStart + iSet*numLinesProcessedOnce + tx] = sum;

  }

   

	}

	

	/* Now do the same for what is left of the last set */

	

	{	

  // Compute the locations of this set

  int setStart = rowIndex[numSets * numLinesProcessedOnce] - 1;

  int setEnd = rowIndex[numLinesToProcess] - 1;

  int setLength = setEnd - setStart;

  

  // Read the colIndex, vals and b

  if (tx < setLength)

  {

    colIndex[tx] = d_colIndex[setStart + tx] - 1;

    vals[tx]     = d_vals[colIndex[tx]];

    b[tx]        = d_b[colIndex[tx]];

  }

  

  __syncthreads();

  

  // Compute the product for all rows in the set and write it

  if (tx < numLinesToProcess - (iSet) * numLinesProcessedOnce)

  {	

  	// Compute

  	float sum = 0;

  	int offset = iSet * numLinesProcessedOnce;	

  	for (int i = rowIndex[tx + offset] - 1; i < rowIndex[tx + offset + 1] - 1; i++)

    sum += vals[i - setStart] * b[i - setStart];

  	// Write to global mem

  	d_res[rowStart + iSet*numLinesProcessedOnce + tx] = sum;

  }

   

	}

}

Hey guys,

If someone can help with an idea, I summarize my problem. I want to do sparse matrix vector multiplication. I have a number of blocks ( n / blockDim.x) each doing 255 rows.
Since the matrix is packed in CSR, for each line I need to:

  1. read the start and end of the row
  2. read all values
  3. read the colum indices
  4. read the corresponding elements of the vector I am multiplying with
  5. multiply

I am doing this in the following way:

  1. Read the start and end values for all lines in shared memory using all 256 threads for all 255 lines processed by the block
  2. Since the maximum number of elements on each row is 27 in my case and I have 256 threads to read with, I divide the lines into sets of 256/27 = 9 lines.
  3. For each set I read for all lines at once, first the column indices, then the values and corresponding vector elements; this read uses 9*27 = 243 threads from the 256
  4. Then I use 9 threads to do the summations, one for each line I process and I write to global memory.

Doing that I get 1.3GFlops … can someone help?
I thought of using 512 threads, but there will be too many unused after the reading
I thought that less threads but still get about the same (about 1.4GFlops for 128 and 1.5GFlops for 64). So this does not seam to be the main bottleneck.
I know I am using a lot of conditions, but I don’t know how to get ride of them …
Also textures … maybe that might help …

Any help would be most welcome,

Regards,
Serban

you give me some idea to prove my .cu code.

CRS formate depent on three vectors : value vector , column index vector and

row index vector . Last two of three vectors column and row vector , i think it is

a problem because every q’s element depent on it.

Every time multiprocessor computes its q’s elements must load operands more

time than stored value in a normal array. (ex : A[ ia[i] ]).

thanks for your reply. :)

also , i had too less threads and too less MP.

i should change code to process more.