Hey Everyone,

I’ve been working to parallelize an algorithm for quite some time now. I am finally done, but my results aren’t impressive - I’m only seeing a 2-3x speed up. I wonder if any of you might have any recommendations.

The algorithm multiplies a column of an NxN matrix by a vector of size N. Each column is dependent on the previous column, so col 2 must wait for col 3 to finish before col 2 can even start. As such, I’ve had to resort to calling my kernel in a for loop from host code. On a whim, I changed the kernel to incorporate a for loop just to see how much faster it would be, and the results were staggering (to me anyway) - 13-16x speedup. This produces incorrect results though, as cuda will execute blocks in any order it sees fit.

The second problem is that I can’t really use shared memory because it takes longer to load it, run the computations then write back to global memory, than just using global memory to begin with. (This is because it must be loaded for each iteration of the for loop in host code).

Anyway, here is the code for the kernel (cuDoubleComplex is a structure similar to the cuComplex structure, but with added position x(px) and position y(py) variables):

```
__global__ void L(cuDoubleComplex *a, cuDoubleComplex* b, cuDoubleComplex *c, int num, int N){
// num refers to the particular column of the matrix
// N = vector size
int tid = blockIdx.x*blockDim.x+threadIdx.x; // map threads to matrix/array elements
if(a[tid].px==num && a[tid].py==num){ // handle elements on the diagonal
if(num==0){ // hardcode special case where num==0
c[num] = a[num] * b[num]; // number on the diagonal for num=0
}else{
c[num] = a[tid] * b[num]; // number on the diagonal
}
}
if(a[tid].px==num && a[tid].py > num && a[tid].py < N){ // numbers below the diagonal
b[a[tid].py] = b[a[tid].py] + a[tid] * b[num];
}
}
```

And I’m calling it like so:

```
for(ctr=0;ctr<N;ctr++) // call L kernel sequentially
L<<<numBlocks,nTL>>>(d_a, d_b, d_c, ctr, N);
```