New Sparse Matrix Project on Github 1.6-4.0x the performance of CUSPARSE

Hi all. I wrote a sparse matrix * vector kernel that I’m sharing with the community. I developed this way back last February for D3D11 CS. It’s been sitting on my hard drive ever since. Bought myself a GTX 570 for Christmas and decided to learn CUDA and port this code.

You can download it from Github:
http://www.github.com/seanbaxter/sparse/

It’s BSD licensed. This is /development/ code so the interface will not be convenient for everyone. I made RAII wrappers for the CUDA driver API and used those in my code - I will be providing a more standard C interface using opaque handles within a week.

I do have some questions about further optimization, but before that a little about the algorithm. It performs very well because of five basic optimizations:

  •   All global memory loads are coalesced.
    
  •   Most of the sum is computed with efficient sequential segmented
          scan as opposed to inefficient parallel segmented scan.
    
  •   There is no branching in the kernel when computing the sum.
    
  •   Segmented scan flags and offsets are baked into colIndices to 
          accelerate inner loops.
    
  •   Low register and shared mem usage delivers high MP occupancy.
    

The readme.txt in the repository has a complete description of the algorithm and walks you through an 8-thread warp example. The algorithm isn’t super complicated and has a nice little description of segmented scan. If you are getting into GPGPU programming, and even if you don’t care about sparse matrices or HPC (that would put us in the same camp), I think this is an illustrative algorithm for learning to negotiate between work-efficient sequential scans and work-inefficient but necessary parallel ones. By sequential operations I mean operations that still run on all threads in the MP, but run sequentially without requiring inter-thread communication.

I’ve benchmarked the code against CUSPARSE using NVIDIA’s standard benchmarking matrices. I’ve run three tests: a double precision test, a single precision test, and a version that serializes floats but converts to double for all the arithmetic:

SeanSparse (double) vs CUSPARSE (double)
Protein: 1.65990x
FEM/Spheres: 1.83037x
FEM/Cantilever: 2.05642x
Wind Tunnel: 2.01992x
FEM/Harbor: 2.43449x
QCD: 2.72919x
FEM/Ship: 1.97966x
Economics: 3.12688x
Epidemiology: 2.10099x
FEM/Accelerator: 2.50494x
Circuit: 2.92598x
Webbase: 1.94037x

SeanSparse (float-double) vs CUSPARSE (float)
Protein: 1.76457x
FEM/Spheres: 2.06819x
FEM/Cantilever: 2.22412x
Wind Tunnel: 2.18713x
FEM/Harbor: 2.64285x
QCD: 3.07786x
FEM/Ship: 2.20049x
Economics: 3.05169x
Epidemiology: 2.19352x
FEM/Accelerator: 2.78746x
Circuit: 3.16430x
Webbase: 2.02053x

SeanSparse (float) vs CUSPARSE (float)
Protein: 2.17766x
FEM/Spheres: 2.54489x
FEM/Cantilever: 2.74985x
Wind Tunnel: 2.67110x
FEM/Harbor: 3.29606x
QCD: 3.82771x
FEM/Ship: 2.68316x
Economics: 4.02870x
Epidemiology: 2.59489x
FEM/Accelerator: 3.10195x
Circuit: 3.84192x
Webbase: 2.45225x

timings.txt in the distribution has the full dump of benchmark timings. You’ll notice a number of different VALUES_PER_THREAD settings in my code - this describes the workload of each thread. Work is executed within warps, each thread loading VALUES_PER_THREAD values/colIndex pairs from global memory. VALUES_PER_THREAD must be at least four, as there are four indices baked into column indices required by each thread. I was anticipating that for denser matrices (where mean non-zero elements/rows > VALUES_PER_THREAD), increasing VALUES_PER_THREAD would lead to asymptotically improving performance, as this increases the ratio of sequential operations to parallel ones (same number of parallel operations no matter what the work density) without increasing either register or shared memory usage. However this is not the case, as for double precision VALUES_PER_THREAD=12 gives the best performance, and for single precision VALUES_PER_THREAD=16 gives best performance. I don’t have an explanation for this yet. Because all the loops are unrolled maybe a larger workload implies instruction cache misses? Seems unlikely, unless the ptxas -v register count is unreliable and the larger workloads are using more registers and therefore getting worse MP occupancy. This is a curiosity I should resolve before giving the library a 1.0 version number.

The other optimization question has to do with mixed float-double mode. As you see from the benchmark summary, my mixed precision kernel is considerably slower than the float kernel. I know that Geforce has crippled DP throughput, but I was really expecting this reduction to be hidden by global loads. Is there perhaps some penalty when switching between single and double precision ops?

I’d like to discuss optimization with anyone who cares to. I think there is still some headroom here, especially with denser matrices.

If you want to try out the code, the benchmark file demonstrates usage. If you want to integrate it with a large existing app, you should still check out the kernel and matrix encoding routines (in /spmxv/) but wait till I post code with an interface that takes CUDA’s opaque handles instead of my wrappers. I’ll also be writing a faster, streaming version of the matrix preparation code. This library uses a format similar to CSR, but with some flags and offsets baked into the high bits of the column indices.

If you are curious about the D3D11 code from last year, it can be found here: http://www.earthrse.com/sparse/
That code has some bugs and is a mess, but if there is interest I can rewrite DirectCompute support to the new github code. Same goes with OpenCL.

.sean