I am trying to multiply two arrays x & y of height 150 with a matrix of width 1313 and height 300. after multiplying the rows of the matrix with x and y, I need to sum each of the columns resulting in a 1313 by 1

After going through the reduction and the matrix examples in the SDK,

I tried doing this:

```
recon_reduce<<< num_blocks, threads>>>(mtx_imd, xsh_d, ysh_d, fac_d); //num_blocks = 83, threads = 16, 83*16=1328
__global__ void recon_reduce(float* im, float* x, float* y, float* out)
{
int tid = threadIdx.x + blockIdx.x*els_per_block; //els_per_block = 150
float val;
const int iters = 150;
const int nact = 1313;
#pragma unroll
for(int i = 0; i < iters; i++)
{
val += (im[2*nact*(tid+i)] * x[i]) + (im[(2*(tid+i)+1)*nact] * y[i]);
}
if(threadIdx.x == 0)
{
out[blockIdx.x] = val;
}
}
```

Two questions:

- When I debug the kernel, the variable val has the correct summed value of all the columns in the matrix. But when the execution reaches the line “out[blockIdx.x] = val;” I see the following error:

```
warning: single stepping a divergent thread.
Single stepping the warp until the divergent thread becomes active. If the thread never becomes active, this operation may not complete.
Program received signal CUDA_EXCEPTION_5, Warp Out-of-range Address.
```

- currently, I am using only 1 thread to do the multiply, add. (the profiler shows that this takes 225 us. If I use all the 16 threads to do the task then, do I need to instantiate 16 arrays to store the intermediate summed values?

suggestions? thanks in advance.