I’m writing an 8x8 idct kernel using the standard simple row/col 1d dct decomposition. Each thread first computes an element of the temp matrix, which is a dot product of a row of the original matrix and the dct constants. Then all the threads *should* be waiting for this matrix to be computed before using it to compute the final result, but I get garbage. My kernel is as such:

```
__global__ static void mat_idct()
{
Â Â __shared__ float temp[64];
Â Â const uint16_t x = threadIdx.x, y = threadIdx.y;
Â Â uint16_t k;
Â Â float elem = 0.0f;
Â Â for (k = 0; k < 8; k++)
Â Â Â Â elem += transform[k*8+x] * block_d[y*8+k];
Â Â temp[y*8+x] = elem;
Â Â __syncthreads();
Â Â elem = 0.0f;
Â Â for (k = 0; k < 8; k++)
Â Â Â Â elem += transform[k*8+y] * temp[k*8+x];
Â Â block_d[y*8+x] = lrintf(elem);
}
```

transform and block_d are global 8x8 matrices.

Enforcing the sync barrier by splitting the row/column dct into separate kernels and using global memory causes it to work perfectly, but it’s half as fast. Am I missing something about __syncthreads()? It seems like it should let a matrix in shared memory be computed in the first half of a kernel and be used in the second half.

Attached is a simple test case; the two 8x8 matrices printed should be identical but are nowhere close for me (I’m on a MacBook Pro, GeForce 8600, Mac OS X 10.5.4)

syncthreads.cu.zip (1.27 KB)