Based on the example gurifisu gave, (s)he would like to base the grid configuration and / or threadblock configuration of a subsequent kernel on the data computed (and stored into device memory) by a previous kernel. That is currently doable only indirectly: either by copying the relevant data back to the host, explicitly or using the zero copy mechanism; or by unneeded threadblocks terminating themselves based on the data stored on the device, as I had suggested.
Could you elaborate about the context and the requirements? In particular, is the return value of the atomic operation needed or is it just used for accumulation? How many independent double-precision accumulators do you need? Are the atomic operations intermixed with non-atomic accesses? Do you have an idea about the dynamic range of the values?
I am thinking about an implementation based on a fixed-point long accumulator.
Yes. Currently we are implementing a particle code for simulating plasma physics, namely, particle-in-cell (PIC) method. In it’s simplest form, a Poisson equation is solved for the electric potential, in which charged particles are moving. The source term of the Poisson is obtained by accumulating charges of the particles. What has been observed in previous GPU implementations is that such accumulations are exceedingly expensive, thus becoming the bottleneck of the whole simulation. As double precision is commonly used for scientific simulations, an efficient atomicAdd for double precision floating point numbers on shared memory is highly desirable.
If one employs one million particles, each particle deposits every time step, and a simulation takes one thousand steps, the number of atomic operations can be very large.
Normally for PIC simulations, they don’t intermixed with non-atomic accesses (I am not completely sure what you mean by this… could you explain it a little or give me an example? Thanks.)
The dynamic range of the values for charge accumulation, as an example, could be around the number of particles. Or, in the case of current (speed) accumulation, it varies between positive and negative speed of light.
Thanks for asking. Please let me know if you have further questions.
Thanks for the explanation.
What I had in mind is that you might be better served by a fixed-point accumulator than by atomic floating-point operations.
An issue of floating-point atomics is that they make the computation become non-deterministic, as floating-point addition is not associative. This may be a concern especially if the accumulated terms can have different signs and may cancel each other.
An alternative is to perform the accumulation in a large fixed-point number (scaled integer). As fixed-point addition is associative and has no rounding error, fixed-point accumulation is deterministic and more amenable to error analysis. Using a sufficiently large accumulator, summations can even be computed exactly.
Even a 64-bit fixed-point accumulator may deliver more accurate results than double-precision floating-point (which has 53 bits of precision). For instance, using a scaling factor of 2^32, values up to 10^9 can be accumulated with a resolution of 10^(-9).
The issues with long accumulators are the following:
Converting a long accumulator back to floating-point is a relatively expensive operation. Frequent conversions will hurt performance (this is what I meant by “intermixing with non-atomic accesses”).
They may use more memory than floating-point accumulators. If you are limited by the amount of shared memory and need accumulators longer than 64-bit, this can be a problem.
It can be implemented so that accumulation is done in constant time regardless of the accumulator width, using a few tricks such as high-radix carry-save representation. But it is probably not worth the effort for accumulators smaller than 128-bit.
Would 64-bit fixed-point be enough for your application? If so, the implementation is much easier, and the issues mentioned above are not a concern.