I’ve been toying with CUDA’fying a few different nullspace vector finding algorithms (Coppersmith/Montgomery’s modifications to Block Lanczos and Block Wiedemann) to add to my number field sieve implementation.

The expensive portion is a multiplication of A times x to yield b and A transpose times x to yield bt. A is a sparse matrix, x is a dense vector, b and bt are the resultant vectors.

The primary matrix I’ve been testing on is a 37875 x 37986 matrix with a total weight of 1012491 and an average of weight of 26.65 per column (very sparse). It seems I’ve no choice but to store it in row offset format.

The first method which oddly produces the fastest results on my GTX 280, however runs ~75% of the speed of a single P4 3.2ghz core is as follows:

Determine column index of thread, do a coalesced load of 4 uint’s containing row offsets or (uint) -1 for null. If a member is not (uint) -1, xor (multiplication in GF(2)) the appropriate offset of x and store in b using atomicXor. Thus loads are coalesced, but there are tons of bank conflicts, stores are random.

The second method consists of storing the row offsets as follows:

[0_0][1_0][2_0][3_0][1_1][2_1]…

[4_0][5_0][6_0][7_0][4_1][5_1]…

…

[60…0][61_0][62_0][63_0][60_1][61_1]…

and each “block” of columns is as long as the heaviest column. Loads do not coalesce, although they should (each load is a uint and __syncthreads() is used). Memory is allocated using cudaMallocStide.

Performance roughly 8% of the single P4 3.2ghz core.

Anyone have any ideas?

PS - If anyone is interested I’ve implemented the first method in a standalone .cu file for testing.

PPS - SGEMM with matrices like these is horrifically slow. Already tried it, plus I’m trying to cram as much in the device as possible, SGEMM eats MxNx2 bits of storage.