Atomic errors?

I am using a kernel that parallels a list (32 threads per list) to perform a force calculation. The calculated forces are reflected on 2 particles (newton law). When I run my kernel several times, there are very slight differences in the results on some particles.

Here is the structure of my kernel :

__global__ void fluidColloidsInteractionList(int*counters,grid_t g,glm::dvec3*fluidPos,glm::dvec3*colloidPos,fluid_data fdata,int vsize,glm::dvec3*Fsum,glm::dvec3*FFsum,int*verlet)
{
    const int id = threadIdx.x + blockIdx.x * blockDim.x;
   // Some shared values
   __shared__ glm::dvec3 sharedValue;
    __shared__ int count;
    __shared__ int counted;
    const int colloidID = id/32;
	const int startpos = colloidID*vsize;
   	__syncthreads();
   	// init/load values with threads 0
 	if(threadIdx.x==0)
 	{
 		sharedValue = colloidPos[colloidID];
 		count = counters[colloidID]; //get the number of item of the list
 	}
   	__syncthreads();
   	const glm::dvec3 pos =  sharedValue;
   	const int counter = count;
   	__syncthreads();
 	if(threadIdx.x==0) sharedValue = glm::dvec3(0);
   	__syncthreads();
    glm::dvec3 res = glm::dvec3(0);
    // read list in //
   	for(int i =startpos+threadIdx.x;i<startpos+counter;i+=32)
   	{
   		const int fid = getIDinList(...);
   		//compute interraction
        glm::dvec3 tmp;// Result of a current step;
	    res+=tmp;//add it to global force
	   	//use newton law to apply -force
	   	atomicAdd(&FFsum[fid].x,-tmp.x);
	   	atomicAdd(&FFsum[fid].y,-tmp.y);
	   	atomicAdd(&FFsum[fid].z,-tmp.z);

   	}
   	// list done
   	__syncthreads();
   	//sum results of the 32 threads in shared mem
	atomicAdd(&sharedValue.x,res.x);
	atomicAdd(&sharedValue.y,res.y);
	atomicAdd(&sharedValue.z,res.z);    	
   	__syncthreads();
   	//Thread 0 write the final result
	if(threadIdx.x==0) Fsum[colloidID]+=sharedValue;
}	

Results in Fsum are always the same while FFsum are not the same every time on few particles per iteration.

any idea what is going wrong?

The order of atomic operations when executed across a warp (and possibly other scenarios) is not specified by CUDA. Since order of operations can affect floating-point results, this is one possible source of variability. In order to invoke this kind of variability, the actual data and the actual order matter. So it’s not possible to do a simplistic comparison with another arithmetic sequence that doesn’t seem to show the problem, to try to rule in or out the hypothesis.

Yes this is the trick. The order of the sum is different for the FFsum computation.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.