The sign of zero with CUDA's float2

Consider the following code performing operations on complex numbers with C/C++'s float:

float real_part = log(3.f);
float imag_part = 0.f;

float2 complex_number3a, complex_number3b;
complex_number3a.x=imag_part;
complex_number3a.y=real_part;
complex_number3b.x=complex_number3a.x*complex_number3a.x-complex_number3a.y*complex_number3a.y;
complex_number3b.y=complex_number3a.x*complex_number3a.y+complex_number3a.y*complex_number3a.x;

The result will be

complex_number3b.x= -1.20695
complex_number3b.y= 0
angle= 3.14159

where angle is the phase of the complex number and, in this case, is pi.

Now consider the following code performing the same operations using CUDA’s float2:

complex_number3a.x=-imag_part;
complex_number3a.y=real_part;
complex_number3b.x=complex_number3a.x*complex_number3a.x-complex_number3a.y*complex_number3a.y;
complex_number3b.y=complex_number3a.x*complex_number3a.y+complex_number3a.y*complex_number3a.x;

The result will be

complex_number3b.x= -1.20695
complex_number3b.y= -0
angle= -3.14159

The imaginary part of the result is -0 which makes the phase of the result be -pi.

Although still accomplishing with the principal argument of a complex number and with the signed property of floating point’s 0, this changes is a problem when one is defining functions of complex numbers. For example, if one is defining sqrt of a float2 number by the de Moiver formula, this will change the sign of the imaginary part of the result to a wrong value.

How to deal with this effect?

I am not exactly sure what you are asking. In general, CUDA should handle signed zeros as prescribed by the IEEE-754 (2008) standard. There were a few issues with incorrect constant propagation by the compiler in the past that affected signed zeros, but those issues were resolved years ago. The signbit() function can be used to check the sign bit, and the copysign() function can be used to manipulate the sign bit.

If the question is “How do I correctly implement branch cuts for complex functions using IEEE-754 signed zero”, whether in CUDA or elsewhere, the standard reference is:

W. Kahan, “Branch Cuts for Complex Elementary Functions; or, Much Ado About Nothing’s Sign Bit”. In: A. Iserles, and M. Powell, (eds.), The State of the Art in Numerical Analysis (1987), pp. 165-211. A copy of this paper can easily be found online using a search engine.

If the question is “Why does CUDA sometimes deliver a -0 result where other platforms deliver a result of +0”, the answer is likely “because CUDA uses FMA contraction by default”. As Kahan points out in one of his papers (“Lecture Notes on the Status of IEEE Standard 754 for Binary Floating-Point Arithmetic”), expressions of the form ab+/-cd can quite easily lead to unexpected behavior when they are evaluated using FMA. An example he gives is a sqrt whose argument is unexpectedly negative leading to a NaN result. Muller et al give a specific numeric example for this scenario in their “Handbook of Floating-Point Arithmetic”. In such cases one can suppress FMA contraction in CUDA locally by using intrinsics, e.g. __fadd_rn(), __fmul_rn(), instead of the normal operators.

First of all, thank you very much for your answer.

I realize that my original post generated some confusion (these are the perils of posting late-night questions), so I have now edited it. In particular, although this is a general floating point question, I have fully focused the attention on CUDA’s float2, instead of mixing float and float2.

Anyway, you have correctly catched that my question was: How do I correctly implement branch cuts for complex functions using IEEE-754 signed zero? The posted reference by Kahan helped me to better understand the problem.

I was trying to define elementary transcendental functions of a complex variable and to use the approach in https://github.com/jtravs/cuda_complex. For the square root, it uses

template<class Type>
__host__ __device__
complex<Type>
sqrt(const complex<Type>& __x)
{
    if (isinf(__x.imag()))
        return complex<Type>(Type(INFINITY), __x.imag());
    if (isinf(__x.real()))
    {
        if (__x.real() > Type(0))
            return complex<Type>(__x.real(), isnan(__x.imag()) ? __x.imag() : copysign(Type(0), __x.imag()));
        return complex<Type>(isnan(__x.imag()) ? __x.imag() : Type(0), copysign(__x.real(), __x.imag()));
    }
    return polar(sqrt(abs(__x)), arg(__x) / Type(2));
}

where polar is defined as

template<class Type>
__host__ __device__ complex<Type> polar(const Type& rho, const Type& theta = Type(0))
{
    if (isnan(rho) || signbit(rho)) return complex<Type>(Type(NAN), Type(NAN));
    if (isnan(theta))
    {
        if (isinf(rho)) return complex<Type>(rho, theta);
        return complex<Type>(theta, theta);
    }
    if (isinf(theta))
    {
        if (isinf(rho)) return complex<Type>(rho, Type(NAN));
        return complex<Type>(Type(NAN), Type(NAN));
    }
    Type x = rho * cos(theta);
    if (isnan(x)) x = 0;
    Type y = rho * sin(theta);
    if (isnan(y))
        y = 0;
    return complex<Type>(x, y);
}

and essentially performs a polar-to-cartesian conversion, while arg is defined as

template<class Type>
inline __host__ __device__ Type arg(const complex<Type>& c) { return atan2(c.imag(), c.real()); }

It seems to me that this approach does not correctly manages the branch cuts, since arg suffers from the originally posted problem.

I’m now testing a different approach, namely

float2_ sqrt(float2_ z)
{
   float x = fabs(z.c.x); float y = fabs(z.c.y);
   if (x > y)
   {
      float r = y / x; float s = sqrt(1.0f + r * r) + 1.0f;
      x = sqrt(x * s / 2.f); y = sqrt(y * r / (s * 2.f));
   }
   else if (y == 0.f) return float2_(0.f, 0.f);
   else
   {
      float r = x / y; float s = sqrt(1.0f + r * r) + r;
      x = sqrt(y * s / 2.f); y = sqrt(y / (s * 2.f));
   }
   if (z.c.y >= 0) { if (z.c.y >= 0.f) return float2_(x, y); else return float2_(y, x); }
   else { if (z.c.x >= 0.f) return float2_(x, -y); else return float2_(y, -x); }
}

where float2_ is my own customized class wrapping around float2, and it seems to solve the problem at least of the original post.

I am in a hurry and may be wrong, but the computation above looks like you would want to use hypot(), instead of doing your own scaling? I have not looked into complex arithmetic beyond the basics, so I am afraid I can’t offer any guidance on dealing with branch cuts in the most efficient (and correct) manner.

Thanks for the suggestion of using hypot. By the way, I realized that the last conditional instructions of the code above were wrong. This is the final, fixed version of sqrt including your recommendation:

__host__ __device__ __forceinline __forceinline__ float2_ sqrt(float2_& z)
{
	float rr = z.c.x;
	float ii = z.c.y;
	
	float x = fabs(rr); float y = fabs(ii);

	if ((x == 0.f)&(y == 0.f)) return float2_(0.f,0.f);

	float temp = hypot(x,y)+x;
	x = sqrt(0.5f * temp);
	y = sqrt(0.5f * y * y / temp);
	if (ii >= 0.f) { if (rr >= 0.f) return float2_(x, y); else return float2_(y, x); }
	else { if (rr >= 0.f) return float2_(x, -y); else return float2_(y, -x); }
}

By the way, we have shared a library of device functions and operators on complex numbers at http://www.orangeowlsolutions.com/bluebird including the following functions: real, imag, abs, angle, conj, log, log10, polar, sqrt, sin, cos, tan, exp, pow, asinh, acosh, atanh, sinh, cosh, tanh, asin, atan, acos. In the downloadable compressed file, there is a Stand-Along Complex directory as well as a Visual Studio usage example. We hope it could be useful to the community.

Again, I haven’t looked too closely, but it seems to me that the computation for x, y could be re-arranged as follows:

float t = hypot(x,y) + x;
float s = rsqrt (0.5f * t);
x = 0.5f * s * t;  // 0.5*t*rsqrt(0.5*t) = 0.5/sqrt(0.5)*t/sqrt(t) = sqrt(0.5)*sqrt(t) = sqrt(0.5*t)
y = 0.5f * s * y;  // 0.5*sqrt(y*y)/sqrt(0.5*t) = sqrt(0.5)*sqrt(y*y)/sqrt(t) = sqrt(0.5*y*y/t)

You might have to handle cases where x or y are zero or infinity separately, but it should simplify the computation for the vast majority of cases.

Looks really nice! I was working an internship this past summer where they did something similar, basically defining a special type of matrix that you could pass to MATLAB and back from C++ via Matlab’s Automation Server, which made prototyping and porting MATLAB code very easy to C++. In this case, it was basically a lot of MATLAB image processing functions ported to C, coupled with templated overloading much like Bluebird does, whereas Bluebird is more complex oriented and a wrapper around CUDA.