Speed up a data-dependent algorithm

Hey Everyone,

I’ve been working to parallelize an algorithm for quite some time now. I am finally done, but my results aren’t impressive - I’m only seeing a 2-3x speed up. I wonder if any of you might have any recommendations.

The algorithm multiplies a column of an NxN matrix by a vector of size N. Each column is dependent on the previous column, so col 2 must wait for col 3 to finish before col 2 can even start. As such, I’ve had to resort to calling my kernel in a for loop from host code. On a whim, I changed the kernel to incorporate a for loop just to see how much faster it would be, and the results were staggering (to me anyway) - 13-16x speedup. This produces incorrect results though, as cuda will execute blocks in any order it sees fit.

The second problem is that I can’t really use shared memory because it takes longer to load it, run the computations then write back to global memory, than just using global memory to begin with. (This is because it must be loaded for each iteration of the for loop in host code).

Anyway, here is the code for the kernel (cuDoubleComplex is a structure similar to the cuComplex structure, but with added position x(px) and position y(py) variables):

__global__ void L(cuDoubleComplex *a, cuDoubleComplex* b, cuDoubleComplex *c, int num, int N){

	// num refers to the particular column of the matrix

	// N = vector size

	int tid = blockIdx.x*blockDim.x+threadIdx.x;				// map threads to matrix/array elements

	if(a[tid].px==num && a[tid].py==num){					// handle elements on the diagonal

		if(num==0){							// hardcode special case where num==0

			c[num] = a[num] * b[num];				// number on the diagonal for num=0


			c[num] = a[tid] * b[num];				// number on the diagonal



	if(a[tid].px==num && a[tid].py > num && a[tid].py < N){		// numbers below the diagonal

		b[a[tid].py] = b[a[tid].py] + a[tid] * b[num];



And I’m calling it like so:

for(ctr=0;ctr<N;ctr++)			// call L kernel sequentially

	L<<<numBlocks,nTL>>>(d_a, d_b, d_c, ctr, N);

Isn’t ‘a’ supposed to be a matrix? What’s the typical size of N? what’s the numBlocks and nTL?

I think you’re probably doing things in a very very wrong manner. Also, I don’t understand why column 1 needs to be dependent on column 0.
If your columns have forward dependency, your rows are independent of each other right?

Maybe you can give a clearer explanation on what it is you are trying to do. A simple matrix multiplication certainly doesn’t have the kind of dependency you mentioned.

A typical size of N is between 6000 and 10000. The matrix is sparse, but we then factorize it to make it a couple times more dense. The calculation isn’t your typical matrix vector multiplication. The matrix is broken up into L and U along the main diagonal. L is column dependent, and then U is row dependent. It is an algorithm used in a power grid simulator to calculate things like line voltage adjustments. I’ve asked the engineers if there is any way around the dependencies and they have said no.

I should go further to explain that the matrix a, is extremely sparse and is in a compacted format. As such, ‘a’ is really just a large vector with x,y coordinates (px, py) and a complex number (vx, vy).

What I’ve done is calculate the number of elements, and then use that in correlation with the number of threads (nTL) to figure out how many blocks (numBlocks) will be needed to have one thread per matrix element. I’ve also found that by using a lower number of threads, I get a slight speed up. e.g. 512 threads vs 1024 (on my 570 fermi based card).

You could be right - I might be doing this in a very very wrong manner, which is why I was hoping for some advice. There are a lot of restrictions plaguing this algorithm, which make it hard to make a good design decision.

Thanks for your response.

Using 512 threads per block gives more registers per thread. I’m surprised the kernel could use more than 32 registers. It could be some problem with the multiplication part.

If I did not understand your code wrongly, your algorithm has these characteristics:
Each element is gone through N times.
c is the only output, c[i] is only dependent on the ith diagonal value and b[i].
b[i] is dependent on all elements in the ith row as well as all the b values before the current b.

If I’m not wrong, then there is very little parallelism that could be used, especially when the matrix has high sparsity. Does your CPU code go through each element N times as well? If that’s the case you’d be better off optimising your CPU code. If you think your CPU code is highly efficient, then I guess the sparsity if the matrix is not as high as I imagined.

EDIT: Just looked at the algorithm again. Hope I’m not wrong:
c[i] = a[i, i] * b[i]
b[i] = sum(r = 0, to i-1) a[r, i]* b[r];
parallelism could be found only when i gets large. That is when the summation has slightly larger number of elements to work on.
Advise from the above analysis: group your elements by rows. Do not use GPU when for the first few hundreds of rows. Go through your elements in rows, not in columns. Each time only go through all elements in the same row. Go through each element only once. At each go, calculate b at the end. Calculate c after all the rows have been gone through.

On another thought, I realized the above advice may not be so good… maybe by expanding b[r] you could get more paralellism… I guess it’s still dependent on the sparsity of your matrix and you’ll have to try various methods to see which one gives better result.