Compilation broken sign-change-detection code

I wanted to make a simple code for detecting the sign change for my CUDA kernel. To avoid flow branches I did this only with comparisons and boolean logic operations. Here is my code:

__host__ __device__

  bool SignChange( const float & a, const float & b )

  {

    /*

0      a < 0

1      a > 0

2      0 ^ 1

3      b < 0

4      b > 0

5      3 ^ 4

(0 ^ 3)

  || (1 ^ 4)

  || (2 ^ 5)

*/

bool d0 = a < 0;

    bool d1 = a > 0;

    bool d2 = d0 ^ d1;

    bool d3 = b < 0;

    bool d4 = b > 0;

    bool d5 = d3 ^ d4;

return (d0 ^ d3) || (d1 ^ d4) || (d2 ^ d5);

  }

If the code isn’t used, it is ignored and the whole program compiles fine. If, however, I try to call that function, the following happens:

### Assertion failure at line 106 of ../../be/cg/NVISA/expand.cxx:

### Compiler Error in file keep/kernel.cpp3.i during Code_Expansion phase:

### unexpected mtype

nvopencc INTERNAL ERROR: /home/tener/localopt/cuda/open64/lib//be returned non-zero status 1

nvcc is:

nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver

Copyright (c) 2005-2010 NVIDIA Corporation

Built on Wed_Nov__3_16:16:57_PDT_2010

Cuda compilation tools, release 3.2, V0.2.1221

The actual component that failed is nvopencc:

/home/tener/localopt/cuda/open64/bin/nvopencc -TARG:compute_20 -m64 -CG:ftz=1 -CG:prec_div=1 -CG:prec_sqrt=1  "keep/kernel" "keep/kernel.cpp3.i"  -o "keep/kernel.ptx"

### Assertion failure at line 106 of ../../be/cg/NVISA/expand.cxx:

### Compiler Error in file keep/kernel.cpp3.i during Code_Expansion phase:

### unexpected mtype

nvopencc INTERNAL ERROR: /home/tener/localopt/cuda/open64/lib//be returned non-zero status 1

Version:

NVIDIA (R) CUDA Open64 Compiler

Cuda compilation tools, release 3.2, V0.2.1221

Built on 2010-11-03

Portions Copyright (c) 2005-2010 NVIDIA Corporation

Portions Copyright (c) 2002-2005 PathScale, Inc.

Portions Copyright (c) 2000-2001 Silicon Graphics, Inc.

All Rights Reserved.

I attach relevant file kernel.cpp3.i. For the faulty code see the line 6026. The function in question is inlined there.

__cuda_local_var_149505_10_non_const_sign_has_changed = (((((((__T261 = ((char)((*((const float *)(&__cuda_local_var_149504_11_non_const_val))) < (0.0F)))) , ((void)(__T262 = ((char)((*((const float *)(&__cuda_local_var_149504_11_non_const_val))) > (0.0F)))))) , ((void)(__T263 = ((char)((((int)__T261) ^ ((int)__T262)) != 0))))) , ((void)(__T264 = ((char)((*((const float *)(&__cuda_local_var_149512_9_non_const_tmp))) < (0.0F)))))) , ((void)(__T265 = ((char)((*((const float *)(&__cuda_local_var_149512_9_non_const_tmp))) > (0.0F)))))) , ((void)(__T266 = ((char)((((int)__T264) ^ ((int)__T265)) != 0))))) , ((char)(((((int)__T261) ^ ((int)__T264)) || (((int)__T262) ^ ((int)__T265))) || (((int)__T263) ^ ((int)__T266)))));

Please tell me if you need more information.
kernel.cpp3.i.txt (263 KB)

Thanks for bringing this to our attention. I was able to repro this with the CUDA 3.2 toolchain, on WinXP64. I will followup with the compiler team.

__host__ __device__  bool SignChange( const float & a, const float & b )  

{    

    /*0  a < 0

      1  a > 0

      2  0 ^ 1

      3  b < 0

      4  b > 0

      5  3 ^ 4     

         (0 ^ 3)  

      || (1 ^ 4)  

      || (2 ^ 5)      

    */        

    bool d0 = a < 0;    

    bool d1 = a > 0;    

    bool d2 = d0 ^ d1;    

    bool d3 = b < 0;    

    bool d4 = b > 0;    

    bool d5 = d3 ^ d4;        

    return (d0 ^ d3) || (d1 ^ d4) || (d2 ^ d5);  

}

__global__ void kernel (float a, float b, bool *i)

{

    *i = SignChange(a, b);

}

C:[…]\r3.2[…]>nvcc -arch=sm_20 -o test test.cu

test.cu

tmpxft_00000a34_00000000-3_test.cudafe1.gpu

tmpxft_00000a34_00000000-8_test.cudafe2.gpu

test.cu

Assertion failure at line 106 of …/…/be/cg/NVISA/expand.cxx:

Compiler Error in file C:/DOCUME~1/njuffa/LOCALS~1/Temp/tmpxft_00000a34_00000000-9_test.cpp3.i during Code_Expansion phase:

unexpected mtype

nvopencc ERROR: c:[…]\r3.2\bin\x86_64_win32_release/open64/lib//be.exe returned non-zero status 1

Yep, compiler bug. If you can crash the compiler (no matter WHAT your code is!), it’s a compiler bug.

As an aside, though, you probably have the wrong mindset about “trying to reduce flow branching.”

For tests like this, there’s no problem really. You need to worry about cases where you do THOUSANDS of instructions with just some threads. Doing tests which may diverge for a few instructions is rarely an efficiency concern.

And an aside aside, can’t you just replace your code with:

__host__ __device__  bool SignChange( const float & a, const float & b )

{

    return a*b>0.0f || (0.0f==a && 0.0f==b);

}

I’m quite amazed that it popped up in such simple code. Yet, maybe I’m just an evil coder ;-)

Well, the documentation has it’s own share of tricks on how to make all threads in (half)warp go the same path, so I’m not sure about this. The only way to be sure is to actually measure the performance, which I was going to do. I’ll try your version and see if it improves anything - I also have 2 other versions, so why not try another one :-)

My particular version has two nice properties that I miss:

  • the compiler is eager to inline it, so it seem it’s simple enough for him. In fact it boils down to single expression, as can be seen on the ptx code I quoted.

  • you can replace 0.0f with some very small epsilon to test for a range around 0.0. In particular I’m worried that 0.0f != -0.0f.

wrong topic, sorry.