Device memory corruption? Will this corrupt memory?

I have a large vector that I want to access in its entirety with each thread in my CUDA kernel. I need to update the existing value of this vector for each thread. So, my kernel has this form:

tid = threadIdx.x + blockIdx.x * blockDim.x;

for(i=0; i < nelem; i++) {
idx = … some index of v computed from i …
v[idx] += f(i);
}

when I run this, for the first few times before the card locks up (I’m using Red Hat 4.0), the output I get in v is not the same from run to run, when it should be completely deterministic. I’ve tried variations like:

for(i=0; i < nelem; i++) {
idx = …
a = v[idx];
__syncthreads();
v[idx] = a + f(i);
__syncthreads();
}

with no effect.

So, my question is: when I have an array that I need to pass over each element of how do I set things up to prevent conflicts when assigning values from concurrent threads?

Any help appreciated!

Ron