I actually posted this in the General Forum but I think it may have been the wrong one. Perhaps it is more suited to here.

Hi Guys,

I have been working on a matrix vector multiplication routine. Due to the properties of the matrix I can split the matrix up into its diagonal and one offset of the diagonal (All entries in a particular column are the same apart from the diagonal). These are stored as 1 d arrays within constant memory. I then multiply these by an array which is stored in shared memory to give me my full matrix vector multiplication. I initially wrote the following code to solve this.

[codebox]static **device** void MatrixVecMult (float *z, float *x) {

int tid = threadIdx.x;

```
int i;
```

x[tid] = L_d[tid] * z[tid];

for (i = 0; i < tid; i++)

```
x[tid] += L[i] * z[i];
```

}[/codebox]

where z and x are declared in shared memory and are of size blockDim and L and L_d are the offsets and the diagonals respectively and are sized in the same way. This code does not work for a thread number over 32. This suggests to me something to do with warps etc. but I am not sure. The following code performs the exact same operations yet it works correctly and is stable.

[codebox]static **device** void MatrixVecMult (float *z, float *x) {

int tid = threadIdx.x;

```
int i;
```

x[tid] = L_d[tid] * z[tid];

for (i = 0; i < blockDim.x; i++) {

```
if( i == tid)
break;
x[tid] += L[i] * z[i];
}
```

}[/codebox]

As you can see the algorithms are the same. Is there a technical issue I am missing here? Something to do with warps and memory access? Is it a bad idea to have the thread ID as a loop condition. Any help on this would be much appreciated as I am quite at a loss as to why one works and one does not.

Further info. Running on Mac OSx, GeForce GT120, compute capability 1.1, fully updated cuda drivers, toolkit etc.

Thanks in advance