SPMV and precaching improving speed of matrix vector multiplication

JA_rgen · April 4, 2011, 7:06am

Hi,

I implemted an CG Solver. The Solver is runing well, but the compute time isn’t so well. So I looked in the profiler an I saw that the matrix vector multiplication takes 94,5% of the hole time. So im trying to reduce this. My idea is to precache only these elements from x (Ax=y) witch I need to compute. But at the moment I have some problems to get the elemts in cache(in right position). Perphaps somebody has done something same and can help me. Here is my code:

//Matrix vector multiplication

__global__ void spmv_jd(const int num_rows,

                        const int* ptr, 

                        const int* indices, 

                        const computetype* data,

                        const computetype* x,

                        computetype* y

                        )

{

//Level 2 multiplication in shared memory and parallel reduction

  __shared__ computetype cache_sum[ 256 ]; //cache with size of threads per block

//Level 3 precaching of elments from x (Ax=y)

  __shared__ computetype cache_x [ 256 ]; //cache for elemts of input vector x

int	idx = blockDim.x * blockIdx.x + threadIdx.x;

  int thread_id = idx + blockIdx.y * gridDim.x * blockDim.x; // global thread index

  int warp_id = thread_id / 64; // global warp index

	int lane = thread_id & (64 - 1); // thread index within the warp

	 //one warp per row

	int row = warp_id ;

	if ( row < num_rows )

  {

		int row_start = ptr [row ];

		int row_end = ptr [ row +1];

//----added-------------------------------------------------

    //here I want to fill the cache only with these elements from x, witch I need to compute later

    cache_x [ threadIdx.x ] = 0;

     if(row_start + threadIdx.x < row_end )

	    cache_x [ threadIdx.x ] = x[ indices [threadIdx.x] ];  

__syncthreads();

//----added--------------------------------------------------

//compute running sum per thread

    cache_sum [ threadIdx.x ] = 0;

		for ( int j = row_start + lane ; j < row_end ; j += 64)

		 cache_sum [ threadIdx.x ] += data [j] * cache_x[j]; //cache [ threadIdx.x ] += data [j] * x[indices [j]]; 

		

		__syncthreads();

		 //parallel reduction in shared memory

if ( lane < 32) cache_sum [ threadIdx.x ] += cache_sum [ threadIdx.x + 32];

    if ( lane < 16) cache_sum [ threadIdx.x ] += cache_sum [ threadIdx.x + 16];

		if ( lane < 8) cache_sum [ threadIdx.x ] += cache_sum [ threadIdx.x + 8];

		if ( lane < 4) cache_sum [ threadIdx.x ] += cache_sum [ threadIdx.x + 4];

		if ( lane < 2) cache_sum [ threadIdx.x ] += cache_sum [ threadIdx.x + 2];

		if ( lane < 1) cache_sum [ threadIdx.x ] += cache_sum [ threadIdx.x + 1];

		

		__syncthreads();

		

		//first thread writes the result

		if ( lane == 0)

		y[ row ] = cache_sum [ threadIdx.x ];

	}

}

avidday · April 4, 2011, 7:50am

You seem to be using the wrong warp size - it should be 32, but I don’t see that making much difference to the way the code works. What GPU are you running this on? If you are using a Fermi card, I seriously doubt that user managed shared memory buffering will perform better than the hardware L1 and L2 caches. SpMV should be memory bandwidth bound. What sort of performance numbers are you getting in your code?

The work of Bell and Garland showed that the biggest wins in SpMV performance to be had come from optimizing the sparse format. They maintain an excellent library called CUSP which include kernels to operate on these optimal sparse formats. You might want to consider trying them out.

JA_rgen · April 4, 2011, 8:20am

I started with a warpsize of 32 then I tried 64 and 64 seems al little bit faster then a sizeof 32. At the moment I am running the code on a Quadro 5000 (Fermi) and I gain for the spmv kernel a glob mem overall throughtput of 4.49694 Gb/s.

I tried the CG-Solver from the CUSP a week ago, but unfortunately it is slower then my version. My idea was to minimize jumping in vector x while cumputing the sum of each row. Because I have a vector x with length of 6400001 but only need 8-10 elements from it. So I thought, if I can cache them by copy that element parallel in a cache and then compute them, could improve the bandwidth.

I attached the profiler info…

avidday · April 4, 2011, 9:01am

It sounds like the sparsity pattern of your matrix is going to dominate the peformance you can achieve on the GPU. The problem I see with the shared memory cache idea is that it won’t really give much in the way of memory bandwidth optimization or data re-use unless the threads operating on each matrix row happen to need the vector data that is in the cache. Otherwise, it is a lot of wasted extra work for no benefit, plus shared memory is about 8 times slower than working in register on Fermi.

It might be that the only optimization you can make is in the area of matrix re-ordering. It might be possible to use an undirected graph to model the structure of the matrix and work out how to reorder the rows so that adjacent rows in the reordered matrix use vector entries which fall with a stride which is useful for the GPU L1/L2 cache or your own buffering system.
I am only a rank amateur when it comes to linear algebra, so I don’t know how feasible that would be.

tera · April 4, 2011, 10:29am

Instead of manually managing a cache, you might just use the [font=“Courier New”]prefetch[/font] PTX instruction of Fermi GPUs to prefetch data into the built-in cache.

EDIT: Just realize you are not trying to optimize latency, so prefetch is of little use.

JA_rgen · April 4, 2011, 3:24pm

It sounds like the sparsity pattern of your matrix is going to dominate the peformance you can achieve on the GPU. The problem I see with the shared memory cache idea is that it won’t really give much in the way of memory bandwidth optimization or data re-use unless the threads operating on each matrix row happen to need the vector data that is in the cache. Otherwise, it is a lot of wasted extra work for no benefit, plus shared memory is about 8 times slower than working in register on Fermi.

It might be that the only optimization you can make is in the area of matrix re-ordering. It might be possible to use an undirected graph to model the structure of the matrix and work out how to reorder the rows so that adjacent rows in the reordered matrix use vector entries which fall with a stride which is useful for the GPU L1/L2 cache or your own buffering system.

I am only a rank amateur when it comes to linear algebra, so I don’t know how feasible that would be.

Thanks for your reply,

I analyst my matrix design an founded the bottleneck. My last row has no zero elements and so I had to compute 640001 elements in the spmy kernel. Now I only compute N-1 row an den I compute the last row and my input vektor with cublasDdot. Results a fine CG is 3 times faster then before and no I have a badnwidth of 12,9 Gb/s for my spmv Kernel.

jasonp · April 5, 2011, 10:36am

Techniques like that are very useful for sparse direct solvers, and there’s a huge literature on reordering techniques for symmetric sparse matrices. For unsymmetric matrices or matrices without any particular format, pretty much the only tool available is graph partitioning; a good partition can concentrate most of the matrix elements into blocks if the matrix is sparse enough.

Topic		Replies	Views
Fastest matrix-vector multiplication? CUDA Programming and Performance	24	3955	May 21, 2011
Help with SpMV algorithm the warped version of the CSR SpMV from the Programming Guide CUDA Programming and Performance	5	2494	January 7, 2010
matrix multiply reduction CUDA Programming and Performance	41	35546	January 15, 2011
best possible matrix-vector multiplication performance? poor guy with only an emulator wonders about CUDA Programming and Performance	6	5598	August 12, 2009
Any ideas on GF(2) sparse matrix-vector multiply? CUDA Programming and Performance	5	5829	August 7, 2008
Trade offs between loading cost of loading to shared memory and working directly on global memory CUDA Programming and Performance	4	489	November 8, 2021
Matrix - Vector Multiplication Can't get any faster with shared memory CUDA Programming and Performance	4	7123	September 6, 2011
Iterative solver speed up How much speed up should I expect? CUDA Programming and Performance	12	4956	May 21, 2009
conjugate gradient CUDA Programming and Performance	16	9829	June 18, 2008
Matrix Partitioning how would you implement it? Sparse matrix partitioning CUDA Programming and Performance	16	14649	March 30, 2010

SPMV and precaching improving speed of matrix vector multiplication

Related topics