computing determinant of structured matrices how to build a scalable solution ?


I’ve been working on the algorithm to compute the determinant of a non-symmetric structured matrix

using the displacement structure approach.

Without getting deep into details, the algorithm is similar to that of standard

Gaussian elimination but performed on matrix generators instead of the matrix itself.

It performs LDU-factorization of a matrix, then the product of diagonal entries yields the determinant.

A simplified algorithm is given below (here an n x n matrix

is represented by 4 generator columns ‘a’, ‘b’, ‘c’ and ‘d’):

for(i = 0; i < n; i++) {

	// get the current top elements of generator columns

	a0 = a[i], b0 = b[i], c0 = c[i], d0 = d[i];

	for(j = i + 1; j < n; j++) {

		// Gaussian elimination via Givens rotations

		aj = a0*a[j] - c0*b[j];

		bj = b0*..

		cj = c0*..

		dj = d0*..

		a[j] = aj, b[j] = bj, c[j] = cj, d[j] = dj; // update vectors with new values


	det = det * a0 * c0; // collect the factors of determinant

	// shift down vectors a and c

	a[k+1] = a[k] forall k = i..n

	c[k+1] = c[k] forall k = i..n


return det;

so, basically the algorithm performs ‘n’ iterations collecting one factor of the determinant at a time.

In each iteration the length of generators (vectors a, b, c and d) decreases by 1.

In my current implementation the inner loop is completely vectorized because

element updates of vectors a, b, c and d can proceed independently.

In other words, I start with ‘n’ working threads where one thread updates one entry of each vector.

In each iteration of the outer loop

the # of working threads decreases by 1. Shifting down the vectors ‘a’ and ‘c’ is done in shared memory.

However this solution is not scalable: I can only handle matrices up to 2048 size, ie., max 512 threads per block, each thread

operates on 4 elements at a time.

So, I need some way to distribute the problem over different blocks which is not easy because

in each iteration of the outer loop the top elements a0, b0, c0 and d0 of generators

must be shared between all threads.

I wonder if anyone have ideas how to deal with this problem ?