I actually posted this in the General Forum but I think it may have been the wrong one. Perhaps it is more suited to here.
Hi Guys,
I have been working on a matrix vector multiplication routine. Due to the properties of the matrix I can split the matrix up into its diagonal and one offset of the diagonal (All entries in a particular column are the same apart from the diagonal). These are stored as 1 d arrays within constant memory. I then multiply these by an array which is stored in shared memory to give me my full matrix vector multiplication. I initially wrote the following code to solve this.
[codebox]static device void MatrixVecMult (float *z, float *x) {
int tid = threadIdx.x;
int i;
x[tid] = L_d[tid] * z[tid];
for (i = 0; i < tid; i++)
x[tid] += L[i] * z[i];
}[/codebox]
where z and x are declared in shared memory and are of size blockDim and L and L_d are the offsets and the diagonals respectively and are sized in the same way. This code does not work for a thread number over 32. This suggests to me something to do with warps etc. but I am not sure. The following code performs the exact same operations yet it works correctly and is stable.
[codebox]static device void MatrixVecMult (float *z, float *x) {
int tid = threadIdx.x;
int i;
x[tid] = L_d[tid] * z[tid];
for (i = 0; i < blockDim.x; i++) {
if( i == tid)
break;
x[tid] += L[i] * z[i];
}
}[/codebox]
As you can see the algorithms are the same. Is there a technical issue I am missing here? Something to do with warps and memory access? Is it a bad idea to have the thread ID as a loop condition. Any help on this would be much appreciated as I am quite at a loss as to why one works and one does not.
Further info. Running on Mac OSx, GeForce GT120, compute capability 1.1, fully updated cuda drivers, toolkit etc.
Thanks in advance