__syncthreads() not syncing

[edit] …and I just saw that there’s a different forum for programming questions, if a moderator could move this there that’s be great.

I’m writing an 8x8 idct kernel using the standard simple row/col 1d dct decomposition. Each thread first computes an element of the temp matrix, which is a dot product of a row of the original matrix and the dct constants. Then all the threads should be waiting for this matrix to be computed before using it to compute the final result, but I get garbage. My kernel is as such:

__global__ static void mat_idct()

{

 Â  Â __shared__ float temp[64];

 Â  Â const uint16_t x = threadIdx.x, y = threadIdx.y;

 Â  Â uint16_t k;

 Â  Â float elem = 0.0f;

  Â for (k = 0; k < 8; k++)

 Â  Â  Â  Â elem += transform[k*8+x] * block_d[y*8+k];

 Â  Â temp[y*8+x] = elem;

  Â __syncthreads();

  Â elem = 0.0f;

 Â  Â for (k = 0; k < 8; k++)

 Â  Â  Â  Â elem += transform[k*8+y] * temp[k*8+x];

 Â  Â block_d[y*8+x] = lrintf(elem);

}

transform and block_d are global 8x8 matrices.

Enforcing the sync barrier by splitting the row/column dct into separate kernels and using global memory causes it to work perfectly, but it’s half as fast. Am I missing something about __syncthreads()? It seems like it should let a matrix in shared memory be computed in the first half of a kernel and be used in the second half.

Attached is a simple test case; the two 8x8 matrices printed should be identical but are nowhere close for me (I’m on a MacBook Pro, GeForce 8600, Mac OS X 10.5.4)
syncthreads.cu.zip (1.27 KB)