Speed up a data-dependent algorithm

fender177 · May 18, 2011, 1:53am

Hey Everyone,

I’ve been working to parallelize an algorithm for quite some time now. I am finally done, but my results aren’t impressive - I’m only seeing a 2-3x speed up. I wonder if any of you might have any recommendations.

The algorithm multiplies a column of an NxN matrix by a vector of size N. Each column is dependent on the previous column, so col 2 must wait for col 3 to finish before col 2 can even start. As such, I’ve had to resort to calling my kernel in a for loop from host code. On a whim, I changed the kernel to incorporate a for loop just to see how much faster it would be, and the results were staggering (to me anyway) - 13-16x speedup. This produces incorrect results though, as cuda will execute blocks in any order it sees fit.

The second problem is that I can’t really use shared memory because it takes longer to load it, run the computations then write back to global memory, than just using global memory to begin with. (This is because it must be loaded for each iteration of the for loop in host code).

Anyway, here is the code for the kernel (cuDoubleComplex is a structure similar to the cuComplex structure, but with added position x(px) and position y(py) variables):

__global__ void L(cuDoubleComplex *a, cuDoubleComplex* b, cuDoubleComplex *c, int num, int N){

	// num refers to the particular column of the matrix

	// N = vector size

	int tid = blockIdx.x*blockDim.x+threadIdx.x;				// map threads to matrix/array elements

	if(a[tid].px==num && a[tid].py==num){					// handle elements on the diagonal

		if(num==0){							// hardcode special case where num==0

			c[num] = a[num] * b[num];				// number on the diagonal for num=0

		}else{

			c[num] = a[tid] * b[num];				// number on the diagonal

		}									

	}

	if(a[tid].px==num && a[tid].py > num && a[tid].py < N){		// numbers below the diagonal

		b[a[tid].py] = b[a[tid].py] + a[tid] * b[num];

	}

}

And I’m calling it like so:

for(ctr=0;ctr<N;ctr++)			// call L kernel sequentially

	L<<<numBlocks,nTL>>>(d_a, d_b, d_c, ctr, N);

hyqneuron · May 18, 2011, 8:13am

Isn’t ‘a’ supposed to be a matrix? What’s the typical size of N? what’s the numBlocks and nTL?

I think you’re probably doing things in a very very wrong manner. Also, I don’t understand why column 1 needs to be dependent on column 0.
If your columns have forward dependency, your rows are independent of each other right?

Maybe you can give a clearer explanation on what it is you are trying to do. A simple matrix multiplication certainly doesn’t have the kind of dependency you mentioned.

fender177 · May 18, 2011, 11:47am

A typical size of N is between 6000 and 10000. The matrix is sparse, but we then factorize it to make it a couple times more dense. The calculation isn’t your typical matrix vector multiplication. The matrix is broken up into L and U along the main diagonal. L is column dependent, and then U is row dependent. It is an algorithm used in a power grid simulator to calculate things like line voltage adjustments. I’ve asked the engineers if there is any way around the dependencies and they have said no.

I should go further to explain that the matrix a, is extremely sparse and is in a compacted format. As such, ‘a’ is really just a large vector with x,y coordinates (px, py) and a complex number (vx, vy).

What I’ve done is calculate the number of elements, and then use that in correlation with the number of threads (nTL) to figure out how many blocks (numBlocks) will be needed to have one thread per matrix element. I’ve also found that by using a lower number of threads, I get a slight speed up. e.g. 512 threads vs 1024 (on my 570 fermi based card).

You could be right - I might be doing this in a very very wrong manner, which is why I was hoping for some advice. There are a lot of restrictions plaguing this algorithm, which make it hard to make a good design decision.

Thanks for your response.

hyqneuron · May 18, 2011, 3:06pm

Using 512 threads per block gives more registers per thread. I’m surprised the kernel could use more than 32 registers. It could be some problem with the multiplication part.

If I did not understand your code wrongly, your algorithm has these characteristics:
Each element is gone through N times.
c is the only output, c[i] is only dependent on the ith diagonal value and b[i].
b[i] is dependent on all elements in the ith row as well as all the b values before the current b.

If I’m not wrong, then there is very little parallelism that could be used, especially when the matrix has high sparsity. Does your CPU code go through each element N times as well? If that’s the case you’d be better off optimising your CPU code. If you think your CPU code is highly efficient, then I guess the sparsity if the matrix is not as high as I imagined.

EDIT: Just looked at the algorithm again. Hope I’m not wrong:
c[i] = a[i, i] * b[i]
b[i] = sum(r = 0, to i-1) a[r, i]* b[r];
parallelism could be found only when i gets large. That is when the summation has slightly larger number of elements to work on.
Advise from the above analysis: group your elements by rows. Do not use GPU when for the first few hundreds of rows. Go through your elements in rows, not in columns. Each time only go through all elements in the same row. Go through each element only once. At each go, calculate b at the end. Calculate c after all the rows have been gone through.

On another thought, I realized the above advice may not be so good… maybe by expanding b[r] you could get more paralellism… I guess it’s still dependent on the sparsity of your matrix and you’ll have to try various methods to see which one gives better result.

Topic		Replies	Views
slow kernel CUDA Programming and Performance	4	1451	June 25, 2009
Paralel Reduction With less than 8000 values CUDA Programming and Performance	27	7753	July 22, 2010
Need Help Optimizing Kernel CUDA Programming and Performance	6	1067	January 7, 2014
Problem with matrix size CUDA Programming and Performance	10	3298	March 27, 2008
Subtle problem with matrix-matrix multiplication CUDA Programming and Performance	10	1213	December 12, 2020
Why different shape matrix multiplication have different performance? CUDA Programming and Performance	2	772	August 26, 2018
Using atomicAdd to step through an array CUDA Programming and Performance	7	3915	May 24, 2011
Cuda Latency problems Slow Cuda CUDA Programming and Performance	15	13942	September 5, 2008
Multiplying a system of 3x3 matrices efficiently CUDA Programming and Performance	2	8849	September 11, 2009
handling blocks and threads..... urgent..! CUDA Programming and Performance	10	1728	June 2, 2009

Speed up a data-dependent algorithm

Related topics