I am using a kernel that parallels a list (32 threads per list) to perform a force calculation. The calculated forces are reflected on 2 particles (newton law). When I run my kernel several times, there are very slight differences in the results on some particles.

Here is the structure of my kernel :

```
__global__ void fluidColloidsInteractionList(int*counters,grid_t g,glm::dvec3*fluidPos,glm::dvec3*colloidPos,fluid_data fdata,int vsize,glm::dvec3*Fsum,glm::dvec3*FFsum,int*verlet)
{
const int id = threadIdx.x + blockIdx.x * blockDim.x;
// Some shared values
__shared__ glm::dvec3 sharedValue;
__shared__ int count;
__shared__ int counted;
const int colloidID = id/32;
const int startpos = colloidID*vsize;
__syncthreads();
// init/load values with threads 0
if(threadIdx.x==0)
{
sharedValue = colloidPos[colloidID];
count = counters[colloidID]; //get the number of item of the list
}
__syncthreads();
const glm::dvec3 pos = sharedValue;
const int counter = count;
__syncthreads();
if(threadIdx.x==0) sharedValue = glm::dvec3(0);
__syncthreads();
glm::dvec3 res = glm::dvec3(0);
// read list in //
for(int i =startpos+threadIdx.x;i<startpos+counter;i+=32)
{
const int fid = getIDinList(...);
//compute interraction
glm::dvec3 tmp;// Result of a current step;
res+=tmp;//add it to global force
//use newton law to apply -force
atomicAdd(&FFsum[fid].x,-tmp.x);
atomicAdd(&FFsum[fid].y,-tmp.y);
atomicAdd(&FFsum[fid].z,-tmp.z);
}
// list done
__syncthreads();
//sum results of the 32 threads in shared mem
atomicAdd(&sharedValue.x,res.x);
atomicAdd(&sharedValue.y,res.y);
atomicAdd(&sharedValue.z,res.z);
__syncthreads();
//Thread 0 write the final result
if(threadIdx.x==0) Fsum[colloidID]+=sharedValue;
}
```

Results in Fsum are always the same while FFsum are not the same every time on few particles per iteration.

any idea what is going wrong?