Transposed-ELLPACK multiplication routine

sBc-Random · June 19, 2013, 8:27am

I’m looking at performing a linear solver of an equation that can be expressed as follows

(X’X + cI) Y = Z

Where I’m storing X as an ELLPACK matrix with the following properties:

Dim(X) ~= (1e6,1e6)
Each row of X has exactly the same number of nnz in any given execution (approx 50)
Each column of the index matrix D of X is independent (the columns pointed to by D(:,i) will never be pointed to in X(:,j), j=/=i)

As I will be using an iterative solver, I will need to perform both the following calculations

(1) A=X*B;

(2) V=trans(X) * U;

Now,
(1) is trivial to code, but I have no idea how to code (2).
It is similar to matlab’s accumarray() applied to each column of D and U. Short of using an atomic add with every write, does anyone have any clue how to perform that calculation, using minimal scratch space on the card?

tera · June 19, 2013, 1:02pm

What is your particular reason to avoid atomic add? For performance reasons or because you want bit-for-it identical results between runs?

Performance-wise atomic add is quite good nowadays. Probably only a factor of 2 slower as in the non-transposed version you can keep the sums in registers.

sBc-Random · June 19, 2013, 1:14pm

Performance wise it’d be very slow. I would’ve thought this is a more researched topic, as it is one of the building blocks of sparse linear regression

vvolkov · June 19, 2013, 10:08pm

I think tera is right, it would be interesting to use atomics in this case. They might do well if you are using a sufficiently recent GPU. Note that different threads will be atomically adding to different locations, so there should be only a little serialization effect.

I don’t know how much you are concerned about performance, but if you need it badly, I’d look into using register blocking instead of ELLPACK.

Finally, why not keep both X and trans(X) as they were two different matrices? If X is constant and you do many iterations, the cost of transposing X might be well amortized.

CudaaduC · June 20, 2013, 12:17am

It may be possible to work with the ELLPACK format in cuSPARSE;

[url]cuSPARSE :: CUDA Toolkit Documentation

While not supported directly I think it can be manipulated into their hybrid format.

I agree with tera and vvolkov in that current day Atomics are really fast, and should considered.

sBc-Random · June 20, 2013, 1:37am

Hey guys,
Thanks for the responses!
CUSPARSE does indeed support hybrid, which is an extension of ELLPACK, so that’s great!
What it doesn’t support is :

multiple right-hand-sides
a Left-hand-side dense matrix rather than RHS, which for me would be more efficient (dimension is a multiple of 32)
Only CUSPARSE_OPERATION_NON_TRANSPOSE is supported within hyb gemv :(

The reason the atomic route to me may not work, is that some of the index columns vary very little (some columns may write to the same index every row), whereas some vary a lot (each row of some columns is a different index).

Here’s a typical ELLPACK index matrix that I may come across

1 10
1 8
1 4
3 7
1 6
1 5
1 11
2 9
3 12

What I see happening, then, is that the threads may be stalling on the less-variable indices (in this case the first column).

The reason I’m not using X and trans(X) is that memory is at a premium in this calculation. To put this in perspective, I have a very quick csr and csc implementation, which allows me to use the same csr storage of a matrix to run both op_non_transpose and op_transpose by switching the indices around, but I feel that there will be a lot to gain from an ELLPACK implementation, particularly when my particular problem is analogous to a dense matmul with reads and register operations, but also saving space on the matrix storage :)

CudaaduC · June 20, 2013, 2:06am

So cuSPARSE Scrscm will not work because the input is not a triangular linear system, and because it mixes the sparse format with the dense?

keep in mind I am fairly new to all this but could you take the Cholesky factor of ((X’X + cI)) then use the triangular solver(assuming ‘I’ is the identity matrix of size Xcols and ‘c’ is a constant like 1.0).

(X’X + cI) should be positive definite in the case where one normalizes the columns of X(if needed). I am guessing that the Xrows>=Xcols based on that order of operations.

Regarding #3 → yes, that is a pain and I have run into that issue myself. Fortunately my data sets are small enough for cuBLAS.

Also one should actually test an Atomic implementation to see how it works, rather than assume it will be slow.

sBc-Random · June 20, 2013, 2:46am

When matrices get to a certain size, factorizations become all but useless unfortunately. This particular problem is on the order of rows = 10^6, cols = 10^6. A factorization would take up terabytes at the very least. Probably petabytes.

Though I will try out an Atomic implementation. Let you know how it goes ;)

sBc-Random · June 21, 2013, 3:17pm

Well, I’ve spent the day to write a first pass algorithm. If anyone can shoot me any pointers that would be greatly appreciated!! :D

This kernel calculates V’ * S where the nnz of each row of S is 32, and number of rows of V is 32. Further, the transposes of the ELLPACK matrices are stored, rather than the non-transpose. See the whos output from matlab.

outz=ellpacktest(index,val,D);
whos
ellpacktest.cpp(63):0.065032
Name Size Bytes Class Attributes

D 32x1000000 256000000 double
index 32x1000000 256000000 double
outz 32x19434 4975104 double
val 32x1000000 256000000 double

#DEFINE BLOCKSIZE 128
#DEFINE BLOCKS_PER_SM 16
template<typename T>
__launch_bounds__(BLOCKSIZE, BLOCKS_PER_SM)
__global__ void ELLgemmT(T* result, unsigned int * index, T *vals, T *dense,
						unsigned int length)
{
/*
Transposed ELLgemm calculation - assumes 32 nonzero per row, and 32 dense columns
Needs generalization.
Stores everything as (transposed) short-fat matrices i.e 32 * BIGDIM
Currently block structure is 32 * 4 - each warp reads in a row (stored on-card as a column)
of each matrix. Each thread then computes a [32][1] strip of the resulting [32][32]
Calculation - obviously an inefficient way to calculate this submatrix
Hopefully you guys can help me out with this bit.
Then, once the 32*32 tile of data is calculated:
Checks whether the next entry down will be writing to the same spot
If so:
Holds the sum in the register. 
If not: 
Adds the register value to global memory
This is useful for low variance indices, particularly dense columns of S.
 
Finally, when it exits the loop, it writes whatever's still left in registry to global mem
*/

	int tidx = threadIdx.x;
	int tidy = threadIdx.y;
	int tid = tidx + 32 * tidy;
	
	//advance pointers
	index 	+= tid + blockIdx.x * BLOCKSIZE;
	vals 	+= tid + blockIdx.x * BLOCKSIZE; 
	dense 	+= tid + blockIdx.x * BLOCKSIZE;
	volatile __shared__ uint   	index_sh[BLOCKSIZE];
	volatile __shared__ uint   	index_sh_buf[BLOCKSIZE];
	volatile __shared__ T 		vals_sh[BLOCKSIZE];

	//prefetch
	index_sh[tid]=index[0];
	vals_sh[tid]=vals[0];

	T dense_reg=dense[0];
	T calc[32]={0.0f};

	T vals_sh_buf=0.0f;
	T dense_reg_buf=0.0f;
	
	for (int read=tidy+blockIdx.x*BLOCKSIZE/32; read<length; read+=BLOCKSIZE*BLOCKS_PER_SM)
	{
		//fetch next iteration's shared mem
		if (read+BLOCKSIZE*BLOCKS_PER_SM<length)
		{
			index_sh_buf[tid]=index[32*BLOCKSIZE*BLOCKS_PER_SM];
			vals_sh_buf=vals[32*BLOCKSIZE*BLOCKS_PER_SM];
			dense_reg_buf=dense[32*BLOCKSIZE*BLOCKS_PER_SM];
		}
		
		//calculate each 32*1 strip
		#pragma unroll
		for (int i=0; i<32; i++)
		{
			T vals_reg=vals_sh[i+32*tidy];
			calc[i]+=vals_reg*dense_reg;
		}

		//Check which writes will be different next iter;
		//Where found, write to the matrix and reset register
		//else donothing
		//#pragma unroll
		for (int i=0; i<32; i++)
		{
			if (index_sh[i+32*tidy]==index_sh_buf[i+32*tidy])
			{
			}
			else
			{
				//result[tidx]+=calc[i]*calc[i];
				atomicAdd(&result[tidx+32*index_sh[i+32*tidy]],calc[i]);
				calc[i]=0;
			}
		}
		//update smem
		index_sh[tid]= index_sh_buf[tid];
		vals_sh[tid] = vals_sh_buf;
		dense_reg = dense_reg_buf;

		index+=32*BLOCKSIZE*BLOCKS_PER_SM;
		vals+=32*BLOCKSIZE*BLOCKS_PER_SM;
		dense+=32*BLOCKSIZE*BLOCKS_PER_SM;
	}
	for (int i=0; i<32; i++)
	{
		atomicAdd(&result[tidx+32*index_sh[i+32*tidy]],calc[i]);
	}
}

void ELLgemm(float *result, unsigned int *index, float *vals, float *dense, unsigned int length, bool trans)
{
	dim3 threadsPerBlock(32,BLOCKSIZE/32);
	int numblocks = BLOCKS_PER_SM * 32;
	if (trans)
	{
		ELLgemmT<float><<<numblocks, threadsPerBlock>>>(result, index, vals, dense, length);
	}
}

ellpack.cu (2.14 KB)

sBc-Random · June 24, 2013, 3:01pm

After taking on board a few pointers from vvolkov (thanks!), starting from scratch and adding complexity where necessary, I’ve so far reduced execution time by about 30%.

I’m struggling to narrow down what the bottleneck is after this point. If I comment out the atomic write, by replacing it with a coalesced dense write, it’s only about 10% faster, which means I’m missing something crucial regarding the memory transactions prior. Will I be facing a large performance hit by the big read-stride from each warp? (Each warp will read in column (a), then column (a+4096), then column (a+8192) )

template<typename T>
__launch_bounds__(256, 8)
__global__ void ELLgemmT(T* result, const unsigned int* __restrict__ index, 
		const T* __restrict__ vals, const T* __restrict__ dense, unsigned int length)
{
int lane = threadIdx.x;
int warp = threadIdx.y;

int off1 = blockIdx.x * 256 + warp * 32;
int off2 = gridDim.x * 256;
//advance pointers
index 	+= off1;
vals 	+= off1; 
dense 	+= off1;
int read = off1/32;
	
T calc[32]={0.0f};

int	index_reg		=	index[lane];
T	vals_reg		=	vals[lane];
T	dense_reg		=	dense[lane];

int index_reg_buf = 0;
T vals_reg_buf = 0.0f;
T dense_reg_buf = 0.0f;
while (read<length)
{
	if (read+1<length)
	{
		index_reg_buf		=	index[lane + off2];
		vals_reg_buf		=	vals[lane + off2];
		dense_reg_buf		=	dense[lane + off2];
	}
	else
	{
		index_reg_buf = -1;
	}
	for (int i=0; i<32; i++)
	{
		T use_vals		= __shfl(vals_reg,i);
		int use_index		= __shfl(index_reg,i);
		int use_index_buf	= __shfl(index_reg_buf,i);		
		calc[i]+=use_vals * dense_reg;
		if (use_index!=use_index_buf)
		{
			result[lane+off1]=calc[i];  //this isn't any faster
			//atomicAdd(result + lane + 32 * use_index,calc[i]);
			calc[i]=0;
		}
	}
	index_reg = index_reg_buf;
	vals_reg = vals_reg_buf;
	dense_reg = dense_reg_buf;
	index += off2;
	dense += off2;
	vals += off2;
	read+= off2/32;
}

}

void ELLgemm(float *result, unsigned int *index, float *vals, float *dense, unsigned int length, bool trans)
{
	dim3 threadsPerBlock(32, 8);
	int numblocks = BLOCKS_PER_SM * 32;
	if (trans)
	{
		ELLgemmT<float><<<numblocks, threadsPerBlock>>>(result, index, vals, dense, length);
	}
	else
	{
		ELLgemmT<float><<<numblocks, threadsPerBlock>>>(result, index, vals, dense, length);
	}
}

sBc-Random · June 29, 2013, 2:22pm

Any ideas, guys? I’m happy to pay for your services, and I need a solution within the next week :)