Who can tell me why the values of weights are 0. after executing the kernel?

Their values are correct, i.e., 6, in emulation mode. But once the code runs on my GeForce 9800 GT, it produces all 0. And if I uncomment the last line and comment its previous line, then the results are also correct.

Why??? What happened to the shared memory?

```
#define SIZE (4)
HelloCUDA<<<1, SIZE, sizeof(float) * SIZE>>>(device_sum, device_weights);
__global__ static void HelloCUDA(float *d_sum, float *weights)
{
extern __shared__ float sharedMem[];
float *w = sharedMem;
float *sum = (float *)&w;
w[threadIdx.x] = threadIdx.x;
__syncthreads();
if (threadIdx.x == 0) {
*sum = 0.;
for (int i = 0; i < blockDim.x; i++) {
*sum += w[i];
}
*d_sum = *sum;
}
__syncthreads();
__threadfence();
weights[threadIdx.x] = *sum;
//weights[threadIdx.x] = *d_sum;
}
```