Hi all. I wrote a sparse matrix * vector kernel that I’m sharing with the community. I developed this way back last February for D3D11 CS. It’s been sitting on my hard drive ever since. Bought myself a GTX 570 for Christmas and decided to learn CUDA and port this code.
You can download it from Github:
It’s BSD licensed. This is /development/ code so the interface will not be convenient for everyone. I made RAII wrappers for the CUDA driver API and used those in my code - I will be providing a more standard C interface using opaque handles within a week.
I do have some questions about further optimization, but before that a little about the algorithm. It performs very well because of five basic optimizations:
All global memory loads are coalesced.
Most of the sum is computed with efficient sequential segmented scan as opposed to inefficient parallel segmented scan.
There is no branching in the kernel when computing the sum.
Segmented scan flags and offsets are baked into colIndices to accelerate inner loops.
Low register and shared mem usage delivers high MP occupancy.
The readme.txt in the repository has a complete description of the algorithm and walks you through an 8-thread warp example. The algorithm isn’t super complicated and has a nice little description of segmented scan. If you are getting into GPGPU programming, and even if you don’t care about sparse matrices or HPC (that would put us in the same camp), I think this is an illustrative algorithm for learning to negotiate between work-efficient sequential scans and work-inefficient but necessary parallel ones. By sequential operations I mean operations that still run on all threads in the MP, but run sequentially without requiring inter-thread communication.
I’ve benchmarked the code against CUSPARSE using NVIDIA’s standard benchmarking matrices. I’ve run three tests: a double precision test, a single precision test, and a version that serializes floats but converts to double for all the arithmetic:
SeanSparse (double) vs CUSPARSE (double)
Wind Tunnel: 2.01992x
SeanSparse (float-double) vs CUSPARSE (float)
Wind Tunnel: 2.18713x
SeanSparse (float) vs CUSPARSE (float)
Wind Tunnel: 2.67110x
timings.txt in the distribution has the full dump of benchmark timings. You’ll notice a number of different VALUES_PER_THREAD settings in my code - this describes the workload of each thread. Work is executed within warps, each thread loading VALUES_PER_THREAD values/colIndex pairs from global memory. VALUES_PER_THREAD must be at least four, as there are four indices baked into column indices required by each thread. I was anticipating that for denser matrices (where mean non-zero elements/rows > VALUES_PER_THREAD), increasing VALUES_PER_THREAD would lead to asymptotically improving performance, as this increases the ratio of sequential operations to parallel ones (same number of parallel operations no matter what the work density) without increasing either register or shared memory usage. However this is not the case, as for double precision VALUES_PER_THREAD=12 gives the best performance, and for single precision VALUES_PER_THREAD=16 gives best performance. I don’t have an explanation for this yet. Because all the loops are unrolled maybe a larger workload implies instruction cache misses? Seems unlikely, unless the ptxas -v register count is unreliable and the larger workloads are using more registers and therefore getting worse MP occupancy. This is a curiosity I should resolve before giving the library a 1.0 version number.
The other optimization question has to do with mixed float-double mode. As you see from the benchmark summary, my mixed precision kernel is considerably slower than the float kernel. I know that Geforce has crippled DP throughput, but I was really expecting this reduction to be hidden by global loads. Is there perhaps some penalty when switching between single and double precision ops?
I’d like to discuss optimization with anyone who cares to. I think there is still some headroom here, especially with denser matrices.
If you want to try out the code, the benchmark file demonstrates usage. If you want to integrate it with a large existing app, you should still check out the kernel and matrix encoding routines (in /spmxv/) but wait till I post code with an interface that takes CUDA’s opaque handles instead of my wrappers. I’ll also be writing a faster, streaming version of the matrix preparation code. This library uses a format similar to CSR, but with some flags and offsets baked into the high bits of the column indices.
If you are curious about the D3D11 code from last year, it can be found here: http://www.earthrse.com/sparse/
That code has some bugs and is a mess, but if there is interest I can rewrite DirectCompute support to the new github code. Same goes with OpenCL.