hi, guys

I’ve got an strange problem, first, look at piece of code:

```
__global__ void k_calc_cost(PTMatrix matrix, int ybase, int xbase)
{
int tx = threadIdx.x;
int ty = threadIdx.y;
int bx = blockIdx.x;
int by = blockIdx.y;
int pbase = ybase + COST_BLOCK_SIZE * by;
int qbase = xbase + COST_BLOCK_SIZE * bx;
int pid = pbase + ty;
int qid = qbase + tx;
if (pid == 0 && qid == 0)
{
*matrix.d_max_candidate_reached_count = 0;
}
__shared__ float W[COST_BLOCK_SIZE][COST_BLOCK_SIZE];
if (pid < matrix.p.number && qid < matrix.q.number)
{
Particle& pxyz = matrix.p.d_array[pid];
Particle& qxyz = matrix.q.d_array[qid];
float cost =
(pxyz.x - qxyz.x) * (pxyz.x - qxyz.x) +
(pxyz.y - qxyz.y) * (pxyz.y - qxyz.y) +
(pxyz.z - qxyz.z) * (pxyz.z - qxyz.z);
W[ty][tx] = cost;
//printf("cost (%d, %d) = %f\n", pid, qid, cost);
}
.........
```

it’s a simple one, the problem is on this line:

W[ty][tx] = cost;

when I commeted it out, the program ran fast, but when I uncomment it, it became very slow, does anybody has idea on what happened? thanks.