I am attempting to write a kernel that performs the following function on a 2D array:

For each row, divide all colums by column 0, sum the result, take the inverse of the result, and place the result in column 0 of the output. Next, do the same thing for column 1 and place the results in column 1 output. For example, if there is a matrix TxK, where K=3, then the calculation for out[0,0] would be:

out[0,0] = 1 / (in[0,0]/in[0,0] + in[0,1]/in[0,0] + in[0,2]/in[0,0])

… …

out[T,0] = 1 / (in[T,0]/in[T,0] + in[T,1]/in[T,0] + in[T,2]/in[T,0])

Currently, I have implemented the following kernel, which I loop over K times.

```
__global__ void updateUFCM(float *U, int nT, int nK) {
int kk = blockIdx.x * blockDim.x + threadIdx.x;
int tt = blockIdx.y * blockDim.y + threadIdx.y;
if (tt < nT && kk < nK) {
float sum = 0.f;
float dk = tex2D(tDX, kk, tt);
float di = 0.f;
for (int ii=0; ii < nK; ii++) {
di = tex2D(tDX, ii, tt);
sum += ( dk/di );
}
float u = 1.0f / sum;
U[tt * nK + kk] = u;
}
}
```

Columns 1 and 2 are working out fine. However, column 0 (which should be all 1s in the case of the K=0 loop) has values that are not 1 in random places.

Could there be memory problems with threads trying to access the same memory address within a block? I am planning to reimplement the code so that the grid.x dimension will be set to the number of Ks. However, this will limit the K value to 512, which would be OK for now.

Is there anything that jumps out as being a really bad idea, other than the fact that it is a naive newbie approach?

I have checked the input matrix, and it appears fine.

Thanks.