Sparse Matrix-Vector Multiplication on CUDA

dneckels · February 15, 2009, 5:26pm

I ported my GMRES solver over to use this excellent matrix multiply library.

Depending on the problem, the CPU solver is still faster. The most expensive part of GMRES is
the orthogonalization step, with relies on cublas. For small vectors it is slower than cpu blas.

Maybe my card is slow, or maybe there is a threshold size where this is no longer true?

I wrote only for floats, since I do not have a card for debugging doubles.

Anyway, its been a fun project (code attached). If anyone has any bright ideas about speeding things up, let me know.
Using two steps of classical gram-schmidt and sgemm might help, but I think I’m done with this one for a while.
cudaztec.zip (4.06 MB)

shijin · February 26, 2009, 9:14am

I cannot successfully download the .pdf file, would you do me a favor to email me(shijin1984@gmail.com) one?

nbell · February 27, 2009, 10:18am

Sent!

e.ping · April 23, 2009, 12:28am

sjoOof · May 8, 2009, 9:59am

Is the spMV library included in the CUDA SDK version 2.2 that was relesased today? I was unable to locate it.

auhgnist · May 14, 2009, 3:35am

It seems that the ELL format is stored in a column-first way, i.e., the elements in a row are stored at least an ell.stride away. Why you don’t arrange elements in a row to be consecutive in memory? Thanks.

jack · May 14, 2009, 3:37am

Even though I prefer consecutive row elements (row-major format)…a lot of numerical-type languages use column-major format (e.g. FORTRAN), so I’m sure that they were just keeping convention for the majority of people that will be using it. I wouldn’t imagine it would be terribly hard to alter the code to work the other way if you needed it to though.

Tell_Me_Why · May 14, 2009, 10:02pm

Would somebody please email me the attachment of post #41 to me? tellmewhy99 at gmail dot com. Thanks!

auhgnist · May 15, 2009, 1:06pm

Thanks. But the direct implementation of row-wise ELL suffers from performance penalties in my experiments. Most probably due to the accessed memory addresses of consecutive threads are not consecutive actually.

nbell · May 27, 2009, 1:25am

Yep, that’s the reason we use column-major order for ELL (and make sure the vector lengths are multiples of 16). Coalescing works perfectly with column major order, but fails miserably with row-major.

If you absolutely must have row-major order, then I would recommend a one-warp-per-row strategy like the one used by the CSR (vector) kernel. Still, I would expect performance to be worse than the straightforward column major version.

jacques2208 · May 27, 2009, 10:38am

Hi Dominik,

A (possibly different) version of this code should appear in the CUDA 2.2 SDK. However, we’ll release the complete source code in the next week or so here.

Unfortunately, I don’t have any direct comparisons to the CNC work. Blocking is often a very useful optimization as it improves locality and reuse of the “source” vector (i.e. x in y = A*x). This is nearly always a win when the matrix has a natural blocksize, such as a vector-valued problem with regular KxK subblocks. The CNC paper uses blocking even though the underlying problem is scalar-valued. In their application (mesh-based matrices) this proves to be advantageous because the subblocks have a reasonable number of nonzeros due to the ordering of rows and columns. However, there are many problems that do not benefit from blocking, so formats like block-CSR aren’t always beneficial.

We’d definitely like to explore this area though. I think it’s likely that different strategies are needed to handle different block sizes optimally. For example, the best method for 2x2 blocks will probably be different than the best method for 6x6 blocks, since you run into a different set of constraints.

Hi nbell,

since few days the 2.2 SDK is out, but I did not found your SpMV code in it and I wonder when your code will be officially released in a CUDA SDK?

Anyway, thanks for your great job and your answer.

Bests,

Luc Buatois.

nbell · May 27, 2009, 10:00pm

Hi Luc,

I too was under the impression that the SpMV code would appear in the 2.2 SDK, but clearly that did not happen. I am currently working a more complete sparse linear algebra library (currently named “Cusp”) that I will make available soon. Like the Thrust library that we just recently released ( http://forums.nvidia.com/index.php?showtopic=97926 ) Cusp will be open source software.

I will inquire about if/when SpMV will appear in the SDK. In the meantime, I would suggest using the code provided in my initial post or Cusp instead of depending on the SDK.

keitat · June 1, 2009, 8:00pm

Has anybody checked the paper by IBM T. J. Watson? It appears their CSR SpMV is much better than nVIDIA’s version and very competitive with the hybrid storage format. I am wondering if IBM is going to release the code.

ttp://domino.watson.ibm.com/library/CyberDig.nsf/1e4115aea78b6e7c85256b360066f0d4/1d32f6d23b99f7898525752200618339?OpenDocument

keitat

nbell · June 1, 2009, 11:58pm

Hi keitat,

I believe the code is available here:

http://www.alphaworks.ibm.com/tech/spmv4gpu

Note that the fastest IBM method pads all matrix rows to a multiple of 16 for alignment, so it’s actually a modified CSR structure. The paper has some good ideas in it like using 16 threads per row (as opposed to a whole warp) and special handling of the first portion of each row when padding is not used.

marcof · June 18, 2009, 10:48am

Hi, I am having some troubles using the CSR kernels into my CG solver: the results are ok at the first iteration, but they become wrong afterwards.

The scalar version of the kernel runs fine if I replace

ValueType sum = y[row];

by

ValueType sum = 0.0f;

The vector version gives me wrong results. When running into emulation mode, I get the following error: “incorrect use of __syncthreads()”. Indeed, thread_lane sometimes equals 16. Then the thread is not synchronized.

I am pretty new to CUDA, and surely missed something, but what? It looks like the call of spmv from the prior iteration interfers with the current one, but I don’t understand why. Do you have any idea?

Thanks,

marcof.

nbell · June 18, 2009, 4:03pm

[quote name=‘marcof’ post=‘554063’ date=‘Jun 18 2009, 06:48 AM’]

Hi, I am having some troubles using the CSR kernels into my CG solver: the results are ok at the first iteration, but they become wrong afterwards.

The scalar version of the kernel runs fine if I replace

[codebox]template

global void

spmv_csr_vector_kernel(const IndexType num_rows,

                   const IndexType * Ap, 

                   const IndexType * Aj, 

                   const ValueType * Ax, 

                   const ValueType * x, 

                         ValueType * y)

{

#ifdef DEVICE_EMULATION

const IndexType thread_id   = BLOCK_SIZE * blockIdx.x + threadIdx.x;  // global thread index

const IndexType thread_lane = threadIdx.x & (WARP_SIZE-1);            // thread index within the warp

const IndexType warp_id     = thread_id   / WARP_SIZE;                // global warp index

const IndexType num_warps   = (BLOCK_SIZE / WARP_SIZE) * gridDim.x;   // total number of active warps

for(IndexType row = warp_id; row < num_rows; row += num_warps){

    // cheat and use one thread to process the whole row

    if (thread_lane == 0){

        ValueType sum = y[row];

const IndexType row_start = Ap[row];

        const IndexType row_end   = Ap[row+1];

for (IndexType jj = row_start; jj < row_end; jj++){

            sum += Ax[jj] * fetch_x<UseCache>(Aj[jj], x);       

        }

y[row] = sum;

    }

}

#else

__shared__ ValueType sdata[BLOCK_SIZE];

__shared__ IndexType ptrs[BLOCK_SIZE/WARP_SIZE][2];

const IndexType thread_id = BLOCK_SIZE * blockIdx.x + threadIdx.x; // global thread index

const IndexType thread_lane = threadIdx.x & (WARP_SIZE-1);            // thread index within the warp

const IndexType warp_id     = thread_id   / WARP_SIZE;                // global warp index

const IndexType warp_lane   = threadIdx.x / WARP_SIZE;                // warp index within the CTA

const IndexType num_warps   = (BLOCK_SIZE / WARP_SIZE) * gridDim.x;   // total number of active warps

for(IndexType row = warp_id; row < num_rows; row += num_warps){

    // use two threads to fetch Ap[row] and Ap[row+1]

    // this is considerably faster than the more straightforward option

    if(thread_lane < 2)

        ptrs[warp_lane][thread_lane] = Ap[row + thread_lane];

    const IndexType row_start = ptrs[warp_lane][0]; //same as: row_start = Ap[row];

    const IndexType row_end   = ptrs[warp_lane][1]; //same as: row_end   = Ap[row+1];

// compute local sum

    sdata[threadIdx.x] = 0;

    for(IndexType jj = row_start + thread_lane; jj < row_end; jj += WARP_SIZE)

        sdata[threadIdx.x] += Ax[jj] * fetch_x<UseCache>(Aj[jj], x);

// reduce local sums to row sum (ASSUME: warpsize 32)

    if (thread_lane < 16) { sdata[threadIdx.x] += sdata[threadIdx.x + 16]; }

    if (thread_lane <  8) { sdata[threadIdx.x] += sdata[threadIdx.x +  8]; }

    if (thread_lane <  4) { sdata[threadIdx.x] += sdata[threadIdx.x +  4]; }

    if (thread_lane <  2) { sdata[threadIdx.x] += sdata[threadIdx.x +  2]; }

    if (thread_lane <  1) { sdata[threadIdx.x] += sdata[threadIdx.x +  1]; }

    // first thread writes warp result

    if (thread_lane == 0)

        y[row] += sdata[threadIdx.x];

}

#endif

}[/codebox]

Tell_Me_Why · June 20, 2009, 7:58am

Hi all,

Has anyone tried to write an efficient sparse matrix transpose vector multiply kernel using CSR format? Any code / paper / advice would be great!

Tell_Me_Why · June 25, 2009, 7:18am

Any idea? :unsure:

nbell · June 25, 2009, 2:20pm

The difficulty with this operation is that multiple threads will need to update the same element of the output vector. You could do it with atomics, but in my experience that can be fairly slow.

I can think of two approaches. The first is to just compute the transpose explicitly and use a standard CSR kernel. The transpose could be computed by sorting the matrix entries by column index.

The second approach would attempt to limit the # of atomic operations to global memory. If we assume that most nonzero entries are near the diagonal of the matrix, then we can accumulate a range of row sums in shared memory first and then send those to global memory. Specifically, assume we have a matrix with 512 rows and columns and two blocks of 256 threads. Block 0 will use shared memory (and shared memory atomics) to store results for rows 0 to 255 while Block 1 will do the same for rows 256 to 511. So if a Block 0 encounters a matrix entry that does lie in the range [0, 255] then it uses an atomic operation on shared memory. If a Block 0 encounters a matrix entry that does not lie in the range [0, 255] then it uses an atomic operation on global memory. Once Block 0 has processed all of its entries, it updates global memory with the contents of shared memory, using global atomics. If we assume that a high percentage of nonzeros live in the diagonal blocks (e.g. 0 <= i < 256 and 0 <= j < 256) then this scheme would dramatically reduce the # of global memory atomics.

I haven’t tested this approach myself, but I’ve heard anecdotal evidence that it’s reasonably efficient.

Tell_Me_Why · July 1, 2009, 12:08pm

Thanks for the idea!

Topic		Replies	Views
Sparse Matrix-Matrix Multiplication CUDA Programming and Performance	14	18374	June 15, 2010
Problems allocating sparse Matrix CUDA Programming and Performance	3	2551	March 15, 2010
Using spmv with large sparse matrix CUDA Programming and Performance	11	29691	March 30, 2011
slow performance cusparse spmv CUDA Programming and Performance	14	3221	December 9, 2013
Fast gemv() CUDA Programming and Performance	40	49742	March 22, 2010
Any ideas on GF(2) sparse matrix-vector multiply? CUDA Programming and Performance	5	5993	August 7, 2008
Matrix Partitioning how would you implement it? Sparse matrix partitioning CUDA Programming and Performance	16	14907	March 30, 2010
SpMV implementation problem CUDA Programming and Performance	3	4226	April 17, 2009
Help Improving Performance using cuSolver/cuSparse Routines GPU-Accelerated Libraries cuda , nsight , performance , python , pycuda	0	765	December 15, 2023
CUSPARSE conversion routines not working... cusparseSnnz and cusparseSdense2csr misbehaving... CUDA Programming and Performance	11	4297	February 28, 2011

Sparse Matrix-Vector Multiplication on CUDA

Related topics