Are the atomic functions broken?


I am trying to write a MD code. At each time step I am sorting the particles in cell with 2 fucntions. First one initializes the cells (bins), while the econd one counts the particles and incriments the counter using atomic functions.
The counter gives values which are not possible I have 1600 particles and teh counter indicates that in a bin there are more than that. Based on the energy I can tells that the number of particles in the bins is not the value I get from the code. Here are the two functions I use.

__global__ void zeroing()
  int icx=threadIdx.x,icy=blockIdx.x;

__global__ void cell_sort(float2 *position,const int ncellx,const int ncelly,const float lcellx,const float lcelly)
  int id = threadIdx.x + blockIdx.x * tpb;
  float2 locpos=position[id];
  int icx=(int)floor((double)locpos.x/(double)lcellx);
  int icy=(int)floor((double)locpos.y/(double)lcelly);
  int loc_ipc;
  if(loc_ipc>=maxnppc) printf("Error %d %d %d %d\n",icx,icy,loc_ipc);

Is there some bug in the atomic functions? Am I using them wrong?

I don’t see anything wrong, although I think your use of __threadfence() is unnecessary in both kernels.

I am kind of amazed to see you got a triple-nested array allocated on the device, though. :)

Thanks for the reply. It appears the problem was related to something else. I had to add a line for the positions before getting the indices:

  int icx=(int)floor((double)locpos.x/(double)lcellx);
  int icy=(int)floor((double)locpos.y/(double)lcelly);

Now it appears to work. It is strange because I was doing it anyway in the integration function.

I am using global device arrays.

__device__ cel_par_ind[8][8][1000]

It makes coding easier in the testing stage. I do not have to keep track of the pointers, allocations and function arguments. It is a bad habit I acquired from programming Fortran.

Ah, OK, I suspected this was the only way to do this without going crazy. :)

Just a side note, your computation of icx and icy are really expensive. It would be much better to pass in 1.0/lcellx and 1.0/lcelly as double arguments from the host, and use those to multiply your input coordinates. Double precision divides are very very slow, and the precompute would eliminate this.

The atomic add will probably be the limiting speed factor of the kernel.

You might also change your overflow error check to print the debug message then assert() to halt the kernel. If you blindly execute the next line, you may write into unknown memory, corrupting unknown things, causing unknown problems. This might even interfere with the previous printf() since it’s implemented as a kind of stored memory log that doesn’t get printed on the host until much later. At the very least, make the next write-to-memory line an “else” branch of the overflow test.


Thanks for the replies. It appears the problem is related to finding the indices icx and icy and some particles got sorted in the wrong place or not sorted. Once I switched everything to double there no more problem. This practically slows down my code by a factor of 2. it seems that the code is memory bound and having float or double operations does not matter so much. This is a little disappointment i did not expect that the float division would give so much problems, especially when there is just some sorting.
The problem come from this kind of line locpos.x=locpos.x-lx*(int)floor((double)locpos.x/(double)lx);
After all it seems that there was a numerical instability from the algorithm which made some particles not be sorted combined with the round errors.

Just in case someone finds this topic by accident. There is nothing wrong with atomic functions and single precision is fined for Langevin dynamics. After many hours of frustration and testing I found a mistake in my implementation of the cell list.