Fast manipulation of large, complex numbers.

In the course of implementing an algorithm, I find myself having to work with complex numbers whose real and imaginary parts are well in excess of DOUBLE_MAX. Unfortunately, I wasn’t satisfied with the performance of the algorithm on a CPU; after a couple of my friends with some programming experience looked at my code, they said I wasn’t doing anything obviously wrong. The problem is embarrassingly parallel, parallel in 2^(lattice width) to be exact.

I don’t need high precision. Four digits of accuracy is fine, but anything less could be an issue.

Notably, extended-precision arithmetic would make this a non-problem for me, since I am computing Z, and 2^16384 > Z > 2^1024

I have been hoping to use CUDA, but this has me stymied. The problem in particular is: I really have no idea what to expect performance-wise from one approximation technique versus another!

I considered storing the logarithm of the relevant numbers; this works well enough for real numbers, but for complex numbers there’s a singularity encountered in about 0.01% of cases where all ordinary approximations fail and more extensive calculation is required. I don’t know if branch prediction exists on CUDA at all, though, so whether it is possible to fix this situation I am not sure. If branch prediction works at all, my problems probably disappear.

In this case the CUDA implementation of log() and exp() can help if it is fast, though I would need to use them several times in the innermost loop of the program, which itself executes (number of order parameters) * (lattice size) * 2^(lattice width) times, which is at least ten million. The algorithm itself has already undergone extensive tuning.

I have also thought of building my own floats with an integer exponent and mantissa. I guess this can work, though it’d probably be rather slow!

More generally, though, I feel like someone has probably had this problem before. I am aware of the techniques used in e.g. GMP, but I suppose I’ve figured that can’t possibly be of much use to me; I just don’t need high precision. I am really hoping there is some commonly-accepted technique for dealing with combinatorially huge numbers.

Knowing almost nothing about your CPU code, it seems like it should port one-to-one to CUDA as long as it restricts itself to the use of the ‘double’ type. The type ‘long double’ is not supported in CUDA, so exponent range is limited to what an IEEE-754 double provides. Maybe some scaling would do the trick? Note that CUDA offers only rudimentary support for complex arithmetic, cuComplex.h only provides the basic arithmetic needed for other CUDA libraries (e.g. CUBLAS, CUSPARSE, CUFFT).

The requirement for extreme exponent ranges strikes me as unusual, can you reveal what the use case is? Most modern platforms (CPUs and GPUs) alike only offer hardware support for IEEE-754 single and double precision, using software emulation for floating-point types requiring an extended exponent range or extended accuracy, so I would expect that some work-arounds have been developed for this use case.

I am a quantum computing researcher, or, more fairly expressed as “grad student”. My current project involves calculating a phenomenological quantum error correction threshold on a topological color code related to the 3-body ±J Ising model on a kisrhombille tiling by using a sparse matrix multiplication optimized transfer-matrix method to calculate order parameters viz. the second-moment correlation lengths in a manner similar to standard transfer-matrix methods for calculating the partition function; the partition function is calculated as well, since it is needed to perform the averaging in the function S(k) from Palassini and Caracciolo.

In particular the magnitude of the partition function tends to be close to exp(-(ground state energy)/(natural temperature)), where the ground state energy I’m using is about -800; it depends on the lattice size, and it is important as large lattices allow much more accurate simulations, and the natural temperature should range over 0.5 to 3, at least, but preferably bringing it down to 0.3 – note that as the temperature approaches zero the partition function behaves erratically: this is the third law of thermodynamics, that absolute zero is unphysical.

But I am guessing lookup tables work okay on CUDA platforms? That may be enough.

Thanks for detailed information. Unfortunately I have zero domain knowledge in his area, but the information provided may lead to advice from CUDA users familiar with this area of computational physics.

I am not sure how the question of lookup tables vs on-the-fly calculation arises because it seemed from your original question that your primary issue is one of data representation, as the required value range using straight-forward computation exceeds the dynamic range of double-precision data. Could you perform most of your computation in log space?

While lookup-table based computation can be appropriate on GPUs, I will point out that the FLOPS on GPUs keep increasing faster than memory bandwidth, so that with the latest GPUs direct computation is often more attractive. If the effort is not unreasonable, you could simply try the computation both ways to see which one is faster. Is the function S(k) you mention defined in this paper:

Palassini, M., and Caracciolo, S. 1999. Universal finite-size scaling functions in the 3D Ising spin glass. Physical Review Letters, 82, 5128—5131

Renormalize to the ground state energy, and the partition function is of order 1. :)

You will still face the problem though that you are summing vast amounts of numbers with hugely varying order of magnitude. In order to deal with this problem, you need to (roughly) sort the number to increasing magnitude, i.e. run the summation from highest to lowest energies. If you do this, you can renormalize to different energies on the fly. I.e. whenever your exponents become too large, you switch to a different energy scale.

Now sorting and summing in order seem to contradict parallel execution of the sum. However it is enough to reorder the states to groups of roughly the same energy, so that exp(-betaenergy) stays well representable within the mantissa of the floating point number (e.g. if , as you said, you only need four digits of precision, using double precision numbers (16 significant digits) you can have groups where exp(-betaenergy) still varies by 10 to 12 orders of magnitude within).

Would a compensated sum (Kahan summation) work in lieu of ordering the summands by magnitude? Or is the dynamic range of the operands in each sum such that this would only provide an incremental improvement to the accuracy of the sum?

The thread opener is looking to sum numbers that vary over 800/0.3/log(10) = 1160 orders of magnitude.

Then again, the Kahan summation would only have to cover the order of magnitude of the number of states summed over (plus the desired number of accurate digits of the end result) as all addends are positive. So Kahan summation could indeed be a viable option.

Thanks Norbert for the suggestion!