atomicAdds within two loops

kai.karius · October 7, 2021, 6:17pm

Dear everybody,

I have a little kernel and I seem to be out of my depths. Here is the kernel itself:

__global__
void potatoe_stamp(float4 * d_particles, float * d_density, uint * d_offsets, int * d_stamp_offsets, float * d_stamp, size_t stamp_vol, size_t chunk_vol, uint chunk_count, size_t particle_num, size_t pixel_num){
      //load stamp and stamp offsets into shared memory
      extern __shared__ float s_stamp[];
      int * s_stamp_offsets = (int *) &s_stamp[chunk_vol];
      uint address;
      //for all chunks
      int c = 0;
      int load;
      while (c < chunk_count){
                  //load stamp chunk to shared
                  int t = threadIdx.x;
                  load = (c == chunk_count - 1) ? stamp_vol - (chunk_count-1)*chunk_vol : chunk_vol;
                  while (t < load){
                          s_stamp_offsets[t] = d_stamp_offsets[c*chunk_vol + t];
                          s_stamp[t] = d_stamp[c*chunk_vol + t];
                          t+= blockDim.x;
                  }
                  __syncthreads();
                  int w = blockIdx.x;
                  while (w < particle_num){
                          t = threadIdx.x;
                          while (t < load){
                                  address = d_offsets[w] + s_stamp_offsets[t];
                                  atomicAdd(&d_density[address], d_particles[w].w*s_stamp[t]);
                                  //atomicAdd(&d_density[address], 1.0*w + 1.0*t);
                                  __syncthreads();
                                  t += blockDim.x;
                          }
                          w += gridDim.x;
                  }
                  c++;
      }
}

And here is what it is supposed to do:

Load chunks of a “stamp”, specified by float values and memory offsets in a preset 3d pixel array (to shared)
For all chunks of a stamp
for all particles in a set
for all pixels of the stamp
calculate the offset of the pixel relative to the particle in a linearised global 3d density

This somewhat works but the results are slightly different in each run. With a little trick,

atomicAdd(&d_density[address], 1.0*w + 1.0*t);

I found that not all particle/pixel combinations are always realized. This makes me think I somehow have misused atomicAdd. The implementation of atomic add is the standard one:

__device__ __forceinline__ float atomicAdd(float *address, float val)
{
    // Doing it all as longlongs cuts one __longlong_as_double from the inner loop
    unsigned int *ptr = (unsigned int *)address;
    unsigned int old, newint, ret = *ptr;
    do {
        old = ret;
        newint = __float_as_int(__int_as_float(old)+val);
    } while((ret = atomicCAS(ptr, old, newint)) != old);

    return __int_as_float(ret);    unsigned long long old, newdbl, ret = *ptr;
    do {
        old = ret;
        newdbl = __double_as_longlong(__longlong_as_double(old)+val);
    } while((ret = atomicCAS(ptr, old, newdbl)) != old);

    return __longlong_as_double(ret);
}
}

Can anybody provide me with a hint?

Robert_Crovella · October 7, 2021, 6:29pm

?

That doesn’t look standard to me. Strange is the adjective that comes to mind. You may wish to study the section on atomics in the programming guide. There are several strange points, to pick just one, why should you need to provide an implementation for float atomicAdd ? Having said all that, your implementation may be working correctly. It’s not clear to me how your “little trick” allows you to distinguish between w=1, t=1 and w=0, t=2.

But the only real suggestion I have is to provide a short, complete test case. Do as you wish, of course.

njuffa · October 7, 2021, 6:32pm

I haven’t looked at the code.

Generally speaking, floating-point operations are not associative. So in a summation, when the order of operands differs, the result may differ, e.g. (a + b) + c != (b + c) + a.

At the same time, the order in which operands are incorporated into the sum when using atomic operations is indeterminate. Therefore, summation by atomic add can deliver different results between multiple runs when summing the same data. Depending on the operands involved, the result could even differ significantly.

kai.karius · October 8, 2021, 11:52am

Hello Robert,

thank you for your hint, I wasn’t ware that there was an implementation already. I got this one from the internet. The little trick gives me indeed no guarantee that the same sum hints at the same loop parameters, but it can tell me that a different run between two consecutive runs hints at a problem.

Best, Kai

kai.karius · October 12, 2021, 9:03am

That is a very interesting point. I’m not married to the floating point representation, I just used it to model a density. I wonder how i could get around it? Since integer arithmetic seems to be associative, would i just use int32 or uint32 (only multiplications and additions) by shifting a few places behind the comma in front of it and cast it as int32? (And, in the end, transform it back to float)?

njuffa · October 12, 2021, 9:11am

You could certainly experiment with a fixed-point representation to achieve deterministic results with atomic additions (as long as there is no overflow). There are applications that actually use such a scheme, although I think they use a 64-bit fixed-point accumulator to avoid issues caused by overly tight restrictions on magnitude. So all computation would be with floats, only at the point of accumulation would there be a switch to a fixed-point format.

Topic		Replies	Views
atomicAdd and concurrent kernels CUDA Programming and Performance	5	2318	August 6, 2013
CUDA dot product atomics problem CUDA Programming and Performance	4	1856	February 26, 2012
multiple threads writing value to a same variable CUDA Programming and Performance	19	6708	March 20, 2012
atomicAdd with float2 no API support, workarounds ? CUDA Programming and Performance	23	5187	January 28, 2021
Atomic float operations. especially add CUDA Programming and Performance	10	16391	July 31, 2009
AtomicAdd algorithm CUDA Programming and Performance	7	3802	August 25, 2009
atomicExch(float) does not work - is it a bug? CUDA Programming and Performance	8	5100	June 27, 2011
atomicAdd(float,float) - atomicMul(float,float) ... CUDA Programming and Performance	13	56602	July 29, 2010
can you give me sample code for atomicAdd()? CUDA Programming and Performance	9	48461	June 5, 2009
Is there a way to avoid atomicAdd in my situation? CUDA Programming and Performance	3	1398	March 4, 2019

atomicAdds within two loops

Related topics