Any ideas on GF(2) sparse matrix-vector multiply?

I’ve been toying with CUDA’fying a few different nullspace vector finding algorithms (Coppersmith/Montgomery’s modifications to Block Lanczos and Block Wiedemann) to add to my number field sieve implementation.

The expensive portion is a multiplication of A times x to yield b and A transpose times x to yield bt. A is a sparse matrix, x is a dense vector, b and bt are the resultant vectors.

The primary matrix I’ve been testing on is a 37875 x 37986 matrix with a total weight of 1012491 and an average of weight of 26.65 per column (very sparse). It seems I’ve no choice but to store it in row offset format.

The first method which oddly produces the fastest results on my GTX 280, however runs ~75% of the speed of a single P4 3.2ghz core is as follows:

Determine column index of thread, do a coalesced load of 4 uint’s containing row offsets or (uint) -1 for null. If a member is not (uint) -1, xor (multiplication in GF(2)) the appropriate offset of x and store in b using atomicXor. Thus loads are coalesced, but there are tons of bank conflicts, stores are random.

The second method consists of storing the row offsets as follows:


and each “block” of columns is as long as the heaviest column. Loads do not coalesce, although they should (each load is a uint and __syncthreads() is used). Memory is allocated using cudaMallocStide.

Performance roughly 8% of the single P4 3.2ghz core.

Anyone have any ideas?

PS - If anyone is interested I’ve implemented the first method in a standalone .cu file for testing.
PPS - SGEMM with matrices like these is horrifically slow. Already tried it, plus I’m trying to cram as much in the device as possible, SGEMM eats MxNx2 bits of storage.

I’m pretty sure my solution won’t work for you since your matrix sizes are huge but I can tell you what I did to solve the same sparse matrix, dense vector problem. But perhaps you can take a couple of insights and work out another way by blocking things further, etc.

We’re trying to compute y = Ax.

I reduced my sparse matrix down into a form that had very few zeros so that it if it was normally

A-MxN, then it turned it into A’ = A-MNpweight where pweight is some percentage of the total N (usually quite small).

I created another matrix I of the same dimensions as A’ that had the index into x that A normally would use if I hadn’t reduced it.

Then I loaded x in its entirety into shared memory (this limits the N to somewhere around 2000 - 3000, I think so it’s not a great solution.) A and I are represented as textures.

I then just chopped up the computation (I tried both 1 thread per dot -product and a block style computation, surprisingly, the 1-thread-per-dp was faster) and saved the results into global memory.

It’s a 6-8x speedup from the CPU ( 3.2Ghz clockspeed ) , and I’m pretty sure I could do better if I optimized (which is hard since I can’t seem to get the profiling tools to function as advertised).

Anyways, here’s the code, maybe it’ll help. Since the block code is cleaner, I’ll paste that.

Kind Regards,


__global__ void

sparse_matrix_vec_mul_block_kernel( float* y, float* A, int * I, float* x, int nrA, int ncA, int nx, float r0)


    // Block index

    int bx = blockIdx.x;


    // Thread index

    int tx = threadIdx.x;

    int ty = threadIdx.y;

   /* row idx of starting row. */

    int ridx = blockDim.y * bx + ty;


    __shared__ float xs[MAX_ROWS];

   int block_size = NTHREADS_PER_BLOCK;

   /* This threads pointer into P. */

    float * threadPp = &Ps[ty][tx];

    *threadPp = 0.0f;


    /* We work in parallel over block_size^2 numbers, so we have to loop over

     * this if the matrix has more than that many columns */

    int column_jump_sizex = block_size * block_size;

    int n_column_jumpsx = nx / column_jump_sizex;

    int jidx;                   /*  jump block idx */

    if ( nx & column_jump_sizex )


    if ( nx < column_jump_sizex )

        n_column_jumpsx = 1;

   // Get x

    int j = 0;

    int xsidx = 0;

    for ( j = 0; j < n_column_jumpsx; j++ )


        jidx = j * column_jump_sizex;

        xsidx = jidx + ty*block_size + tx;

        if ( xsidx < nx ) 


            xs[xsidx] = x[xsidx] + r0;



    __syncthreads();            // wait for data .

   int column_jump_sizeA = block_size * block_size;

    int n_column_jumpsA = ncA / column_jump_sizeA;

    if ( ncA & column_jump_sizeA )


    if ( ncA < column_jump_sizeA )

        n_column_jumpsA = 1;

   int b;

    int cidx;                   /* column idx w.r.t to matrix */

    int bidx;                   /* block starting idx w.r.t. matrix */

    float1 A4;                  /* matrix values in texture  */

    int1 k;                     /* indexes in to x held in texture matrix I */

    int tex_cidx;               /* column idx w.r.t. to textures (div 4) */

    for ( j = 0; j < n_column_jumpsA; j++ )


        jidx = j * column_jump_sizeA;

        bidx = tx * block_size;

        for ( b = 0; b < block_size; b+=1 )


            cidx = jidx + bidx + b;

            k = tex2D(texRefI, cidx, ridx);

            A4 = tex2D(texRefA, cidx, ridx);

            if ( cidx < ncA && ridx < nrA )

                *threadPp += A4.x * xs[k.x];




   /* Reduction of block to y. e.g. y[i] <- P[i][0]+..+P[i][end] */

    /* Should be in a for loop. -DCS:2008/08/03 */

    int i = 0;

    for ( i = block_size >> 1; i > 0; i = i >> 1) /* div 2, quicker */


        if ( tx < i )

            *threadPp += *(threadPp + i);




    if ( tx == 0 )


        y[ridx] = Ps[ty][0];



Have you looked at this:

For the code go to:

and then click on the project “Concurrent Number” in the flash application. They implemented several SpMM (Sparse Matrix x Matrix) in CUDA.

Unfortunately the CUDPP sparse matrix routines are not very efficient for matrices this sparse or with their distribution. I’m getting times of minutes on a small matrix vector multiply with a GTX280 when on the 3.2ghz P4 a few seconds…

Yeah, your kernel is a bit too divergent for my matrix.

I’ve done a good bit more work on this problem and come up with this:

Due to the nature of the problem, I’ve got a collection of B smooth integers with their exponents represented in GF(2). The naive layout would be as follows:

| 8 | 35 | 54 | 99

11 | 0 | 0 | 0 | 1 |

7 | 0 | 1 | 0 | 0 |

5 | 0 | 1 | 0 | 0 |

3 | 0 | 0 | 1 | 0 |

2 | 1 | 0 | 1 | 0 |

Further there is a quadratic character base which is necessary for the algorithm to work properly which has a density of approximately 50%. My idea is as follows, for any row with a weight larger than ncols / 32 break that into a dense matrix. The dense matrix has uint’s representing 32 rows of a column:

[column #1][column #2]…[column #N]

with column #x = [row #0][row #1]…[row #31]

For the sparse part of the matrix store in column offset format so for the matrix above (assuming no dense columns):

row #0 = [ 3 ]

row #1 = [ 1 ]

row #2 = [ 1 ]

row #3 = [ 2 ]

row #4 = [ 0 ][ 2 ]

then the rows are grouped by weight and a row index is stored. This way a small LUT can be used to translate back to row position. This means shared memory can be used to calculate output rows, the warp stays convergent and busy, and the only random loads are the x[column offset]'s. At this point I do not believe this problem can be tackled efficiently with 100% “proper” memory accesses, but having at least the output vector in shared memory until the end of a block and the thread occupancy high is a huge performance gain.

I’ve got CPU versions of these algorithms that seem to fit the Cuda access patterns, etc. However I’ve yet to implement the sparse portion in GPU and the dense portion is giving me odd results.

Here’s the dense kernel thus far (btw, this is computing 32 matrix vector products on 32 dense rows, unused columns are set to 0 to not alter results, hardcoded until I get the performance right):

/* b = A*x */

__global__ void matvec_dense(uint *A, uint *x, uint *b)


 Â  Â  Â  Â register uint i, j, rx, rA;

 Â  Â  Â  Â register uint idx = blockIdx.x * blockDim.x + threadIdx.x;

 Â  Â  Â  Â __shared__ uint sh_b[32];

  Â  Â  Â rx = x[idx];

 Â  Â  Â  Â __syncthreads(); Â // improved performance

 Â  Â  Â  Â rA = A[idx];

 Â  Â  Â  Â __syncthreads(); // ditto

 Â  Â  Â  Â if(threadIdx.x < 32)

 Â  Â  Â  Â  Â  Â  Â  Â sh_b[threadIdx.x] = 0;

 Â  Â  Â  Â __syncthreads(); // prevent warp to block race

  Â  Â  Â #pragma unroll

 Â  Â  Â  Â for(i = 0; i < 16; i++)

 Â  Â  Â  Â {

 Â  Â  Â  Â  Â  Â  Â  Â j = (threadIdx.x + i) & 15; // make sure per warp access is coalesced

 Â  Â  Â  Â  Â  Â  Â  Â if(rA & (1 << j))

 Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â atomicXor(&sh_b[j], rx);

 Â  Â  Â  Â  Â  Â  Â  Â __syncthreads();

 Â  Â  Â  Â }

 Â  Â  Â  Â #pragma unroll

 Â  Â  Â  Â for(i = 0; i < 16; i++)

 Â  Â  Â  Â {

 Â  Â  Â  Â  Â  Â  Â  Â j = ((threadIdx.x + i) & 15) + 16; // same as last loop

 Â  Â  Â  Â  Â  Â  Â  Â if(rA & (1 << j))

 Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â atomicXor(&sh_b[j], rx);

 Â  Â  Â  Â  Â  Â  Â  Â __syncthreads();

 Â  Â  Â  Â }

 Â  Â  Â  Â // write results back to output vector

 Â  Â  Â  Â if(threadIdx.x < 32)

 Â  Â  Â  Â  Â  Â  Â  Â atomicXor(&b[threadIdx.x], sh_b[threadIdx.x]);

 Â  Â  Â  Â return;


[pstach@beast matvec_dense]$ ./matvec_dense
colrnd: 11217920
CPU: 2.868069s
GPU: 0.070692s

11217920 columns (when rounded to 512 column blocks) using the algorithm from the last post for 32 dense rows. CPU time vs GPU time looks like the right direction.