Sparse Matrix-Vector Multiplication on CUDA

I ported my GMRES solver over to use this excellent matrix multiply library.

Depending on the problem, the CPU solver is still faster. The most expensive part of GMRES is
the orthogonalization step, with relies on cublas. For small vectors it is slower than cpu blas.

Maybe my card is slow, or maybe there is a threshold size where this is no longer true?

I wrote only for floats, since I do not have a card for debugging doubles.

Anyway, its been a fun project (code attached). If anyone has any bright ideas about speeding things up, let me know.
Using two steps of classical gram-schmidt and sgemm might help, but I think I’m done with this one for a while.
cudaztec.zip (4.06 MB)

I cannot successfully download the .pdf file, would you do me a favor to email me(shijin1984@gmail.com) one?

Sent!

Is the spMV library included in the CUDA SDK version 2.2 that was relesased today? I was unable to locate it.

It seems that the ELL format is stored in a column-first way, i.e., the elements in a row are stored at least an ell.stride away. Why you don’t arrange elements in a row to be consecutive in memory? Thanks.

Even though I prefer consecutive row elements (row-major format)…a lot of numerical-type languages use column-major format (e.g. FORTRAN), so I’m sure that they were just keeping convention for the majority of people that will be using it. I wouldn’t imagine it would be terribly hard to alter the code to work the other way if you needed it to though.

Would somebody please email me the attachment of post #41 to me? tellmewhy99 at gmail dot com. Thanks!

Thanks. But the direct implementation of row-wise ELL suffers from performance penalties in my experiments. Most probably due to the accessed memory addresses of consecutive threads are not consecutive actually.

Yep, that’s the reason we use column-major order for ELL (and make sure the vector lengths are multiples of 16). Coalescing works perfectly with column major order, but fails miserably with row-major.

If you absolutely must have row-major order, then I would recommend a one-warp-per-row strategy like the one used by the CSR (vector) kernel. Still, I would expect performance to be worse than the straightforward column major version.

Hi nbell,

since few days the 2.2 SDK is out, but I did not found your SpMV code in it and I wonder when your code will be officially released in a CUDA SDK?

Anyway, thanks for your great job and your answer.

Bests,

Luc Buatois.

Hi Luc,

I too was under the impression that the SpMV code would appear in the 2.2 SDK, but clearly that did not happen. I am currently working a more complete sparse linear algebra library (currently named “Cusp”) that I will make available soon. Like the Thrust library that we just recently released ( http://forums.nvidia.com/index.php?showtopic=97926 ) Cusp will be open source software.

I will inquire about if/when SpMV will appear in the SDK. In the meantime, I would suggest using the code provided in my initial post or Cusp instead of depending on the SDK.

Has anybody checked the paper by IBM T. J. Watson? It appears their CSR SpMV is much better than nVIDIA’s version and very competitive with the hybrid storage format. I am wondering if IBM is going to release the code.

ttp://domino.watson.ibm.com/library/CyberDig.nsf/1e4115aea78b6e7c85256b360066f0d4/1d32f6d23b99f7898525752200618339?OpenDocument

keitat

Hi keitat,

I believe the code is available here:

http://www.alphaworks.ibm.com/tech/spmv4gpu

Note that the fastest IBM method pads all matrix rows to a multiple of 16 for alignment, so it’s actually a modified CSR structure. The paper has some good ideas in it like using 16 threads per row (as opposed to a whole warp) and special handling of the first portion of each row when padding is not used.

Hi, I am having some troubles using the CSR kernels into my CG solver: the results are ok at the first iteration, but they become wrong afterwards.

  • The scalar version of the kernel runs fine if I replace
ValueType sum = y[row];

by

ValueType sum = 0.0f;
  • The vector version gives me wrong results. When running into emulation mode, I get the following error: “incorrect use of __syncthreads()”. Indeed, thread_lane sometimes equals 16. Then the thread is not synchronized.

I am pretty new to CUDA, and surely missed something, but what? It looks like the call of spmv from the prior iteration interfers with the current one, but I don’t understand why. Do you have any idea?

Thanks,

marcof.

[quote name=‘marcof’ post=‘554063’ date=‘Jun 18 2009, 06:48 AM’]

Hi, I am having some troubles using the CSR kernels into my CG solver: the results are ok at the first iteration, but they become wrong afterwards.

  • The scalar version of the kernel runs fine if I replace
[codebox]template

global void

spmv_csr_vector_kernel(const IndexType num_rows,

                   const IndexType * Ap, 

                   const IndexType * Aj, 

                   const ValueType * Ax, 

                   const ValueType * x, 

                         ValueType * y)

{

#ifdef DEVICE_EMULATION

const IndexType thread_id   = BLOCK_SIZE * blockIdx.x + threadIdx.x;  // global thread index

const IndexType thread_lane = threadIdx.x & (WARP_SIZE-1);            // thread index within the warp

const IndexType warp_id     = thread_id   / WARP_SIZE;                // global warp index

const IndexType num_warps   = (BLOCK_SIZE / WARP_SIZE) * gridDim.x;   // total number of active warps

for(IndexType row = warp_id; row < num_rows; row += num_warps){

    // cheat and use one thread to process the whole row

    if (thread_lane == 0){

        ValueType sum = y[row];

const IndexType row_start = Ap[row];

        const IndexType row_end   = Ap[row+1];

for (IndexType jj = row_start; jj < row_end; jj++){

            sum += Ax[jj] * fetch_x<UseCache>(Aj[jj], x);       

        }

y[row] = sum;

    }

}

#else

__shared__ ValueType sdata[BLOCK_SIZE];

__shared__ IndexType ptrs[BLOCK_SIZE/WARP_SIZE][2];

const IndexType thread_id = BLOCK_SIZE * blockIdx.x + threadIdx.x; // global thread index

const IndexType thread_lane = threadIdx.x & (WARP_SIZE-1);            // thread index within the warp

const IndexType warp_id     = thread_id   / WARP_SIZE;                // global warp index

const IndexType warp_lane   = threadIdx.x / WARP_SIZE;                // warp index within the CTA

const IndexType num_warps   = (BLOCK_SIZE / WARP_SIZE) * gridDim.x;   // total number of active warps

for(IndexType row = warp_id; row < num_rows; row += num_warps){

    // use two threads to fetch Ap[row] and Ap[row+1]

    // this is considerably faster than the more straightforward option

    if(thread_lane < 2)

        ptrs[warp_lane][thread_lane] = Ap[row + thread_lane];

    const IndexType row_start = ptrs[warp_lane][0]; //same as: row_start = Ap[row];

    const IndexType row_end   = ptrs[warp_lane][1]; //same as: row_end   = Ap[row+1];

// compute local sum

    sdata[threadIdx.x] = 0;

    for(IndexType jj = row_start + thread_lane; jj < row_end; jj += WARP_SIZE)

        sdata[threadIdx.x] += Ax[jj] * fetch_x<UseCache>(Aj[jj], x);

// reduce local sums to row sum (ASSUME: warpsize 32)

    if (thread_lane < 16) { sdata[threadIdx.x] += sdata[threadIdx.x + 16]; }

    if (thread_lane <  8) { sdata[threadIdx.x] += sdata[threadIdx.x +  8]; }

    if (thread_lane <  4) { sdata[threadIdx.x] += sdata[threadIdx.x +  4]; }

    if (thread_lane <  2) { sdata[threadIdx.x] += sdata[threadIdx.x +  2]; }

    if (thread_lane <  1) { sdata[threadIdx.x] += sdata[threadIdx.x +  1]; }

    // first thread writes warp result

    if (thread_lane == 0)

        y[row] += sdata[threadIdx.x];

}

#endif

}[/codebox]

Hi all,

Has anyone tried to write an efficient sparse matrix transpose vector multiply kernel using CSR format? Any code / paper / advice would be great!

Any idea? :unsure:

The difficulty with this operation is that multiple threads will need to update the same element of the output vector. You could do it with atomics, but in my experience that can be fairly slow.

I can think of two approaches. The first is to just compute the transpose explicitly and use a standard CSR kernel. The transpose could be computed by sorting the matrix entries by column index.

The second approach would attempt to limit the # of atomic operations to global memory. If we assume that most nonzero entries are near the diagonal of the matrix, then we can accumulate a range of row sums in shared memory first and then send those to global memory. Specifically, assume we have a matrix with 512 rows and columns and two blocks of 256 threads. Block 0 will use shared memory (and shared memory atomics) to store results for rows 0 to 255 while Block 1 will do the same for rows 256 to 511. So if a Block 0 encounters a matrix entry that does lie in the range [0, 255] then it uses an atomic operation on shared memory. If a Block 0 encounters a matrix entry that does not lie in the range [0, 255] then it uses an atomic operation on global memory. Once Block 0 has processed all of its entries, it updates global memory with the contents of shared memory, using global atomics. If we assume that a high percentage of nonzeros live in the diagonal blocks (e.g. 0 <= i < 256 and 0 <= j < 256) then this scheme would dramatically reduce the # of global memory atomics.

I haven’t tested this approach myself, but I’ve heard anecdotal evidence that it’s reasonably efficient.

Thanks for the idea!