atomicAdds within two loops

Dear everybody,

I have a little kernel and I seem to be out of my depths. Here is the kernel itself:

void potatoe_stamp(float4 * d_particles, float * d_density, uint * d_offsets, int * d_stamp_offsets, float * d_stamp, size_t stamp_vol, size_t chunk_vol, uint chunk_count, size_t particle_num, size_t pixel_num){
      //load stamp and stamp offsets into shared memory
      extern __shared__ float s_stamp[];
      int * s_stamp_offsets = (int *) &s_stamp[chunk_vol];
      uint address;
      //for all chunks
      int c = 0;
      int load;
      while (c < chunk_count){
                  //load stamp chunk to shared
                  int t = threadIdx.x;
                  load = (c == chunk_count - 1) ? stamp_vol - (chunk_count-1)*chunk_vol : chunk_vol;
                  while (t < load){
                          s_stamp_offsets[t] = d_stamp_offsets[c*chunk_vol + t];
                          s_stamp[t] = d_stamp[c*chunk_vol + t];
                          t+= blockDim.x;
                  int w = blockIdx.x;
                  while (w < particle_num){
                          t = threadIdx.x;
                          while (t < load){
                                  address = d_offsets[w] + s_stamp_offsets[t];
                                  atomicAdd(&d_density[address], d_particles[w].w*s_stamp[t]);
                                  //atomicAdd(&d_density[address], 1.0*w + 1.0*t);
                                  t += blockDim.x;
                          w += gridDim.x;

And here is what it is supposed to do:

  1. Load chunks of a “stamp”, specified by float values and memory offsets in a preset 3d pixel array (to shared)
  2. For all chunks of a stamp
    for all particles in a set
    for all pixels of the stamp
    calculate the offset of the pixel relative to the particle in a linearised global 3d density

This somewhat works but the results are slightly different in each run. With a little trick,

atomicAdd(&d_density[address], 1.0*w + 1.0*t);

I found that not all particle/pixel combinations are always realized. This makes me think I somehow have misused atomicAdd. The implementation of atomic add is the standard one:

__device__ __forceinline__ float atomicAdd(float *address, float val)
    // Doing it all as longlongs cuts one __longlong_as_double from the inner loop
    unsigned int *ptr = (unsigned int *)address;
    unsigned int old, newint, ret = *ptr;
    do {
        old = ret;
        newint = __float_as_int(__int_as_float(old)+val);
    } while((ret = atomicCAS(ptr, old, newint)) != old);

    return __int_as_float(ret);    unsigned long long old, newdbl, ret = *ptr;
    do {
        old = ret;
        newdbl = __double_as_longlong(__longlong_as_double(old)+val);
    } while((ret = atomicCAS(ptr, old, newdbl)) != old);

    return __longlong_as_double(ret);

Can anybody provide me with a hint?


That doesn’t look standard to me. Strange is the adjective that comes to mind. You may wish to study the section on atomics in the programming guide. There are several strange points, to pick just one, why should you need to provide an implementation for float atomicAdd ? Having said all that, your implementation may be working correctly. It’s not clear to me how your “little trick” allows you to distinguish between w=1, t=1 and w=0, t=2.

But the only real suggestion I have is to provide a short, complete test case. Do as you wish, of course.

I haven’t looked at the code.

Generally speaking, floating-point operations are not associative. So in a summation, when the order of operands differs, the result may differ, e.g. (a + b) + c != (b + c) + a.

At the same time, the order in which operands are incorporated into the sum when using atomic operations is indeterminate. Therefore, summation by atomic add can deliver different results between multiple runs when summing the same data. Depending on the operands involved, the result could even differ significantly.

Hello Robert,

thank you for your hint, I wasn’t ware that there was an implementation already. I got this one from the internet. The little trick gives me indeed no guarantee that the same sum hints at the same loop parameters, but it can tell me that a different run between two consecutive runs hints at a problem.

Best, Kai

That is a very interesting point. I’m not married to the floating point representation, I just used it to model a density. I wonder how i could get around it? Since integer arithmetic seems to be associative, would i just use int32 or uint32 (only multiplications and additions) by shifting a few places behind the comma in front of it and cast it as int32? (And, in the end, transform it back to float)?

You could certainly experiment with a fixed-point representation to achieve deterministic results with atomic additions (as long as there is no overflow). There are applications that actually use such a scheme, although I think they use a 64-bit fixed-point accumulator to avoid issues caused by overly tight restrictions on magnitude. So all computation would be with floats, only at the point of accumulation would there be a switch to a fixed-point format.