SPMV and precaching improving speed of matrix vector multiplication


I implemted an CG Solver. The Solver is runing well, but the compute time isn’t so well. So I looked in the profiler an I saw that the matrix vector multiplication takes 94,5% of the hole time. So im trying to reduce this. My idea is to precache only these elements from x (Ax=y) witch I need to compute. But at the moment I have some problems to get the elemts in cache(in right position). Perphaps somebody has done something same and can help me. Here is my code:

//Matrix vector multiplication

__global__ void spmv_jd(const int num_rows,

                        const int* ptr, 

                        const int* indices, 

                        const computetype* data,

                        const computetype* x,

                        computetype* y



//Level 2 multiplication in shared memory and parallel reduction

  __shared__ computetype cache_sum[ 256 ]; //cache with size of threads per block

//Level 3 precaching of elments from x (Ax=y)

  __shared__ computetype cache_x [ 256 ]; //cache for elemts of input vector x

int	idx = blockDim.x * blockIdx.x + threadIdx.x;

  int thread_id = idx + blockIdx.y * gridDim.x * blockDim.x; // global thread index

  int warp_id = thread_id / 64; // global warp index

	int lane = thread_id & (64 - 1); // thread index within the warp

	 //one warp per row

	int row = warp_id ;

	if ( row < num_rows )


		int row_start = ptr [row ];

		int row_end = ptr [ row +1];


    //here I want to fill the cache only with these elements from x, witch I need to compute later

    cache_x [ threadIdx.x ] = 0;

     if(row_start + threadIdx.x < row_end )

	    cache_x [ threadIdx.x ] = x[ indices [threadIdx.x] ];  



//compute running sum per thread

    cache_sum [ threadIdx.x ] = 0;

		for ( int j = row_start + lane ; j < row_end ; j += 64)

		 cache_sum [ threadIdx.x ] += data [j] * cache_x[j]; //cache [ threadIdx.x ] += data [j] * x[indices [j]]; 



		 //parallel reduction in shared memory

if ( lane < 32) cache_sum [ threadIdx.x ] += cache_sum [ threadIdx.x + 32];

    if ( lane < 16) cache_sum [ threadIdx.x ] += cache_sum [ threadIdx.x + 16];

		if ( lane < 8) cache_sum [ threadIdx.x ] += cache_sum [ threadIdx.x + 8];

		if ( lane < 4) cache_sum [ threadIdx.x ] += cache_sum [ threadIdx.x + 4];

		if ( lane < 2) cache_sum [ threadIdx.x ] += cache_sum [ threadIdx.x + 2];

		if ( lane < 1) cache_sum [ threadIdx.x ] += cache_sum [ threadIdx.x + 1];




		//first thread writes the result

		if ( lane == 0)

		y[ row ] = cache_sum [ threadIdx.x ];



You seem to be using the wrong warp size - it should be 32, but I don’t see that making much difference to the way the code works. What GPU are you running this on? If you are using a Fermi card, I seriously doubt that user managed shared memory buffering will perform better than the hardware L1 and L2 caches. SpMV should be memory bandwidth bound. What sort of performance numbers are you getting in your code?

The work of Bell and Garland showed that the biggest wins in SpMV performance to be had come from optimizing the sparse format. They maintain an excellent library called CUSP which include kernels to operate on these optimal sparse formats. You might want to consider trying them out.

I started with a warpsize of 32 then I tried 64 and 64 seems al little bit faster then a sizeof 32. At the moment I am running the code on a Quadro 5000 (Fermi) and I gain for the spmv kernel a glob mem overall throughtput of 4.49694 Gb/s.

I tried the CG-Solver from the CUSP a week ago, but unfortunately it is slower then my version. My idea was to minimize jumping in vector x while cumputing the sum of each row. Because I have a vector x with length of 6400001 but only need 8-10 elements from it. So I thought, if I can cache them by copy that element parallel in a cache and then compute them, could improve the bandwidth.

I attached the profiler info…

It sounds like the sparsity pattern of your matrix is going to dominate the peformance you can achieve on the GPU. The problem I see with the shared memory cache idea is that it won’t really give much in the way of memory bandwidth optimization or data re-use unless the threads operating on each matrix row happen to need the vector data that is in the cache. Otherwise, it is a lot of wasted extra work for no benefit, plus shared memory is about 8 times slower than working in register on Fermi.

It might be that the only optimization you can make is in the area of matrix re-ordering. It might be possible to use an undirected graph to model the structure of the matrix and work out how to reorder the rows so that adjacent rows in the reordered matrix use vector entries which fall with a stride which is useful for the GPU L1/L2 cache or your own buffering system.
I am only a rank amateur when it comes to linear algebra, so I don’t know how feasible that would be.

Instead of manually managing a cache, you might just use the [font=“Courier New”]prefetch[/font] PTX instruction of Fermi GPUs to prefetch data into the built-in cache.

EDIT: Just realize you are not trying to optimize latency, so prefetch is of little use.

Thanks for your reply,

I analyst my matrix design an founded the bottleneck. My last row has no zero elements and so I had to compute 640001 elements in the spmy kernel. Now I only compute N-1 row an den I compute the last row and my input vektor with cublasDdot. Results a fine CG is 3 times faster then before and no I have a badnwidth of 12,9 Gb/s for my spmv Kernel.

Techniques like that are very useful for sparse direct solvers, and there’s a huge literature on reordering techniques for symmetric sparse matrices. For unsymmetric matrices or matrices without any particular format, pretty much the only tool available is graph partitioning; a good partition can concentrate most of the matrix elements into blocks if the matrix is sparse enough.