 # Transposed-ELLPACK multiplication routine

I’m looking at performing a linear solver of an equation that can be expressed as follows

(X’X + cI) Y = Z

Where I’m storing X as an ELLPACK matrix with the following properties:

1. Dim(X) ~= (1e6,1e6)

2. Each row of X has exactly the same number of nnz in any given execution (approx 50)

3. Each column of the index matrix D of X is independent (the columns pointed to by D(:,i) will never be pointed to in X(:,j), j=/=i)

As I will be using an iterative solver, I will need to perform both the following calculations

(1) A=X*B;

(2) V=trans(X) * U;

Now,
(1) is trivial to code, but I have no idea how to code (2).
It is similar to matlab’s accumarray() applied to each column of D and U. Short of using an atomic add with every write, does anyone have any clue how to perform that calculation, using minimal scratch space on the card?

What is your particular reason to avoid atomic add? For performance reasons or because you want bit-for-it identical results between runs?

Performance-wise atomic add is quite good nowadays. Probably only a factor of 2 slower as in the non-transposed version you can keep the sums in registers.

Performance wise it’d be very slow. I would’ve thought this is a more researched topic, as it is one of the building blocks of sparse linear regression

I think tera is right, it would be interesting to use atomics in this case. They might do well if you are using a sufficiently recent GPU. Note that different threads will be atomically adding to different locations, so there should be only a little serialization effect.

I don’t know how much you are concerned about performance, but if you need it badly, I’d look into using register blocking instead of ELLPACK.

Finally, why not keep both X and trans(X) as they were two different matrices? If X is constant and you do many iterations, the cost of transposing X might be well amortized.

It may be possible to work with the ELLPACK format in cuSPARSE;

http://docs.nvidia.com/cuda/cusparse/index.html#topic_3_3_4

While not supported directly I think it can be manipulated into their hybrid format.

I agree with tera and vvolkov in that current day Atomics are really fast, and should considered.

Hey guys,
Thanks for the responses!
CUSPARSE does indeed support hybrid, which is an extension of ELLPACK, so that’s great!
What it doesn’t support is :

1. multiple right-hand-sides
2. a Left-hand-side dense matrix rather than RHS, which for me would be more efficient (dimension is a multiple of 32)
3. Only CUSPARSE_OPERATION_NON_TRANSPOSE is supported within hyb gemv :(

The reason the atomic route to me may not work, is that some of the index columns vary very little (some columns may write to the same index every row), whereas some vary a lot (each row of some columns is a different index).

Here’s a typical ELLPACK index matrix that I may come across

1 10
1 8
1 4
3 7
1 6
1 5
1 11
2 9
3 12

What I see happening, then, is that the threads may be stalling on the less-variable indices (in this case the first column).

The reason I’m not using X and trans(X) is that memory is at a premium in this calculation. To put this in perspective, I have a very quick csr and csc implementation, which allows me to use the same csr storage of a matrix to run both op_non_transpose and op_transpose by switching the indices around, but I feel that there will be a lot to gain from an ELLPACK implementation, particularly when my particular problem is analogous to a dense matmul with reads and register operations, but also saving space on the matrix storage :)

So cuSPARSE Scrscm will not work because the input is not a triangular linear system, and because it mixes the sparse format with the dense?

keep in mind I am fairly new to all this but could you take the Cholesky factor of ((X’X + cI)) then use the triangular solver(assuming ‘I’ is the identity matrix of size Xcols and ‘c’ is a constant like 1.0).

(X’X + cI) should be positive definite in the case where one normalizes the columns of X(if needed). I am guessing that the Xrows>=Xcols based on that order of operations.

Regarding #3 -> yes, that is a pain and I have run into that issue myself. Fortunately my data sets are small enough for cuBLAS.

Also one should actually test an Atomic implementation to see how it works, rather than assume it will be slow.

When matrices get to a certain size, factorizations become all but useless unfortunately. This particular problem is on the order of rows = 10^6, cols = 10^6. A factorization would take up terabytes at the very least. Probably petabytes.

Though I will try out an Atomic implementation. Let you know how it goes ;)

Well, I’ve spent the day to write a first pass algorithm. If anyone can shoot me any pointers that would be greatly appreciated!! :D

This kernel calculates V’ * S where the nnz of each row of S is 32, and number of rows of V is 32. Further, the transposes of the ELLPACK matrices are stored, rather than the non-transpose. See the whos output from matlab.

outz=ellpacktest(index,val,D);
whos
ellpacktest.cpp(63):0.065032
Name Size Bytes Class Attributes

D 32x1000000 256000000 double
index 32x1000000 256000000 double
outz 32x19434 4975104 double
val 32x1000000 256000000 double

``````#DEFINE BLOCKSIZE 128
#DEFINE BLOCKS_PER_SM 16
template<typename T>
__launch_bounds__(BLOCKSIZE, BLOCKS_PER_SM)
__global__ void ELLgemmT(T* result, unsigned int * index, T *vals, T *dense,
unsigned int length)
{
/*
Transposed ELLgemm calculation - assumes 32 nonzero per row, and 32 dense columns
Needs generalization.
Stores everything as (transposed) short-fat matrices i.e 32 * BIGDIM
Currently block structure is 32 * 4 - each warp reads in a row (stored on-card as a column)
of each matrix. Each thread then computes a  strip of the resulting 
Calculation - obviously an inefficient way to calculate this submatrix
Hopefully you guys can help me out with this bit.
Then, once the 32*32 tile of data is calculated:
Checks whether the next entry down will be writing to the same spot
If so:
Holds the sum in the register.
If not:
Adds the register value to global memory
This is useful for low variance indices, particularly dense columns of S.

Finally, when it exits the loop, it writes whatever's still left in registry to global mem
*/

int tid = tidx + 32 * tidy;

index 	+= tid + blockIdx.x * BLOCKSIZE;
vals 	+= tid + blockIdx.x * BLOCKSIZE;
dense 	+= tid + blockIdx.x * BLOCKSIZE;
volatile __shared__ uint   	index_sh[BLOCKSIZE];
volatile __shared__ uint   	index_sh_buf[BLOCKSIZE];
volatile __shared__ T 		vals_sh[BLOCKSIZE];

//prefetch
index_sh[tid]=index;
vals_sh[tid]=vals;

T dense_reg=dense;
T calc={0.0f};

T vals_sh_buf=0.0f;
T dense_reg_buf=0.0f;

{
//fetch next iteration's shared mem
{
index_sh_buf[tid]=index[32*BLOCKSIZE*BLOCKS_PER_SM];
vals_sh_buf=vals[32*BLOCKSIZE*BLOCKS_PER_SM];
dense_reg_buf=dense[32*BLOCKSIZE*BLOCKS_PER_SM];
}

//calculate each 32*1 strip
#pragma unroll
for (int i=0; i<32; i++)
{
T vals_reg=vals_sh[i+32*tidy];
calc[i]+=vals_reg*dense_reg;
}

//Check which writes will be different next iter;
//Where found, write to the matrix and reset register
//else donothing
//#pragma unroll
for (int i=0; i<32; i++)
{
if (index_sh[i+32*tidy]==index_sh_buf[i+32*tidy])
{
}
else
{
//result[tidx]+=calc[i]*calc[i];
calc[i]=0;
}
}
//update smem
index_sh[tid]= index_sh_buf[tid];
vals_sh[tid] = vals_sh_buf;
dense_reg = dense_reg_buf;

index+=32*BLOCKSIZE*BLOCKS_PER_SM;
vals+=32*BLOCKSIZE*BLOCKS_PER_SM;
dense+=32*BLOCKSIZE*BLOCKS_PER_SM;
}
for (int i=0; i<32; i++)
{
}
}

void ELLgemm(float *result, unsigned int *index, float *vals, float *dense, unsigned int length, bool trans)
{
int numblocks = BLOCKS_PER_SM * 32;
if (trans)
{
ELLgemmT<float><<<numblocks, threadsPerBlock>>>(result, index, vals, dense, length);
}
}
``````

ellpack.cu (2.14 KB)

After taking on board a few pointers from vvolkov (thanks!), starting from scratch and adding complexity where necessary, I’ve so far reduced execution time by about 30%.

I’m struggling to narrow down what the bottleneck is after this point. If I comment out the atomic write, by replacing it with a coalesced dense write, it’s only about 10% faster, which means I’m missing something crucial regarding the memory transactions prior. Will I be facing a large performance hit by the big read-stride from each warp? (Each warp will read in column (a), then column (a+4096), then column (a+8192) )

``````template<typename T>
__launch_bounds__(256, 8)
__global__ void ELLgemmT(T* result, const unsigned int* __restrict__ index,
const T* __restrict__ vals, const T* __restrict__ dense, unsigned int length)
{

int off1 = blockIdx.x * 256 + warp * 32;
int off2 = gridDim.x * 256;
index 	+= off1;
vals 	+= off1;
dense 	+= off1;

T calc={0.0f};

int	index_reg		=	index[lane];
T	vals_reg		=	vals[lane];
T	dense_reg		=	dense[lane];

int index_reg_buf = 0;
T vals_reg_buf = 0.0f;
T dense_reg_buf = 0.0f;
{
{
index_reg_buf		=	index[lane + off2];
vals_reg_buf		=	vals[lane + off2];
dense_reg_buf		=	dense[lane + off2];
}
else
{
index_reg_buf = -1;
}
for (int i=0; i<32; i++)
{
T use_vals		= __shfl(vals_reg,i);
int use_index		= __shfl(index_reg,i);
int use_index_buf	= __shfl(index_reg_buf,i);
calc[i]+=use_vals * dense_reg;
if (use_index!=use_index_buf)
{
result[lane+off1]=calc[i];  //this isn't any faster
//atomicAdd(result + lane + 32 * use_index,calc[i]);
calc[i]=0;
}
}
index_reg = index_reg_buf;
vals_reg = vals_reg_buf;
dense_reg = dense_reg_buf;
index += off2;
dense += off2;
vals += off2;
}

}

void ELLgemm(float *result, unsigned int *index, float *vals, float *dense, unsigned int length, bool trans)
{