Are there plans to implement -ffinite-math-only -fno-signed-zeros?

lanwatch2 · November 15, 2024, 7:58pm

Or -ffast-math in general, for that matter…

Using clang, no code is generated for the arithmetic bits, since for normal floating points the result is NOOP (add and subtract the same amount).

NVCC cannot optimize that. In simulation codes where a lot of constants are used (think stencil codes) this implies unnecessary operations. In order to work around this, we need to employ rather ugly code generation to explicitly list the operations for each direction instead of using a simple table of constants like we can do on CPU.

Also notice that the integer version is not fully optimized. NVCC does succeed in removing the multiplication by 0, but not in realizing that the addition and subtraction cancel out. Earlier versions of clang (16 an earlier) do not optimize this either.

striker159 · November 15, 2024, 9:33pm

The integer version will be fully optimized if input pointer and output pointer are marked as __restrict__ to tell the compiler there is no pointer aliasing. The float version is unaffected.

Robert_Crovella · November 16, 2024, 1:50am

Are you referring to host-code clang? Because the godbolt link you provided shows arithmetic being generated from cuclang.

I guess I must be dense.

Anyway the way to make a feature request the usual suggestion is to file a bug. Future plans are rarely discussed on these forums.

njuffa · November 16, 2024, 2:30am

I understand the code deposited at Compile Explorer is just an example, but off-hand I don’t see how -fno-signed-zeros would help with code like this? In the context of addition signed zeros usually don’t propagate far, and by initializing with +0 that seems certain here?

The utility of -ffinite-math-only I can see, as without it the additions cannot be optimized out because A[i] could be infinity, in which case -INF + INF must deliver a NaN, so the addition cannot simply be removed.

Historically, when it comes to floating-point computation, the general philosophy of the CUDA compiler has been to treat such code conservatively, with the exception of FMA merging, which is turned on by default because it is just too important to fully utilize the computational throuighput of the GPU, but it can be turned off with a compiler switch. The existing -use_fast_math switch activates a very specific set of “looseness” features in handling floating-point computations that was designed for staying performance-competitive with code written using NVIDIA’s Cg shading language in the early days of CUDA. You can tell its one of the earliest switches because it doesn’t follow the regular convention (using underscore instead of dash).

The conservative treatment of floating-point computation was motivated by the desire to keep programmer’s sanity by not destroying carefully crafted numerical algorithms. This is something various host compilers of the time did at the drop of a hat, with poorly designed or unavailable compiler switches to control it. Minimizing programmer frustration is paramount when trying to establish a new computing platform. To be perfectly honest, I have been writing high-performance floating-point code for almost three decades, and I usually go with the equivalent of what clang folks know as -ffp-model=strict or -ffp-model=precise to maintain my sanity.

The compiler world has progressed since these design decisions were made. Supporting deviations from IEEE-754 compliant floating-point handling with fairly fine-grained opt-in features with which programmers are familiar from other tool chains seems non-problematic in principle. If you decide to file an enhancement request with NVIDIA (which I would encourage), make sure to include motivation, e.g. how much performance gain at application level would be expected, which application areas could benefit, etc. NVIDIA has historically been very open to customer requests, that is how CUDA and its ecosystem grew. The more customers ask for the same feature, the wider the general applicability, and the more pronounced the impact of what is being requested the higher the chances that feature will get implemented.

lanwatch2 · November 16, 2024, 8:52am

Good point, this is useful, thanks!

lanwatch2 · November 16, 2024, 8:58am

I meant that using cuclang, for the float kernel:

__global__ void floatKernel(const float *A, float *C, int numElements) {
  int i = blockDim.x * blockIdx.x + threadIdx.x;

    if (i < numElements) 
    {
        C[i] = 0.f;
        C[i] += 1 * A[i];
        C[i] += -1 * A[i]; 
        C[i] += 0 * A[i];
    }
}

I get this PTX:

        ld.param.u64    %rd2, [floatKernel(float const*, float*, int)_param_1];
        cvta.to.global.u64      %rd3, %rd2;
        ld.param.u32    %r1, [floatKernel(float const*, float*, int)_param_2];
        mov.u32         %r2, %ntid.x;
        mov.u32         %r3, %ctaid.x;
        mov.u32         %r4, %tid.x;
        mad.lo.s32      %r5, %r2, %r3, %r4;
        setp.ge.s32     %p1, %r5, %r1;
        @%p1 bra        $L__BB0_2;
        mul.wide.s32    %rd4, %r5, 4;
        add.s64         %rd1, %rd3, %rd4;
        mov.b32         %r6, 0;
        st.global.u32   [%rd1], %r6;
$L__BB0_2:
        ret;

For the simplified kernel:

__global__ void floatKernel(const float *A, float *C, int numElements) {
  int i = blockDim.x * blockIdx.x + threadIdx.x;

    if (i < numElements) 
    {
        C[i] = 0.f;
    }
}

I get the same PTX:

        ld.param.u64    %rd2, [floatKernel(float const*, float*, int)_param_1];
        cvta.to.global.u64      %rd3, %rd2;
        ld.param.u32    %r1, [floatKernel(float const*, float*, int)_param_2];
        mov.u32         %r2, %ntid.x;
        mov.u32         %r3, %ctaid.x;
        mov.u32         %r4, %tid.x;
        mad.lo.s32      %r5, %r2, %r3, %r4;
        setp.ge.s32     %p1, %r5, %r1;
        @%p1 bra        $L__BB0_2;
        mul.wide.s32    %rd4, %r5, 4;
        add.s64         %rd1, %rd3, %rd4;
        mov.b32         %r6, 0;
        st.global.u32   [%rd1], %r6;
$L__BB0_2:
        ret;

I thought the same, but when I remove the -fno-signed-zeros from the options code is generated for the arithmetic bits, removing -fno-signed-zerosresults in more code. Actually -ffinite-math-only > no flags > -fno-signed-zeros > -fno-signed-zeros -ffinite-math-only, by LOC. I don’t get the compiler logic though.

njuffa:

The compiler world has progressed since these design decisions were made. Supporting deviations from IEEE-754 compliant floating-point handling with fairly fine-grained opt-in features with which programmers are familiar from other tool chains seems non-problematic in principle. If you decide to file an enhancement request with NVIDIA (which I would encourage), make sure to include motivation, e.g. how much performance gain at application level would be expected, which application areas could benefit, etc. NVIDIA has historically been very open to customer requests, that is how CUDA and its ecosystem grew. The more customers ask for the same feature, the wider the general applicability, and the more pronounced the impact of what is being requested the higher the chances that feature will get implemented.

Thanks, that is encouraging! I’ll look into it. I can work a bit and provide actual performance numbers (naive implementation with all the mults by 0, 1, -1, vs CAS-generated simplifed expressions). Off the top of my head it is about 20-30%, so pretty significant.

njuffa · November 16, 2024, 9:05am

Is cuclang something that is provided and/or maintained by NVIDIA?

My advice to file an enhancement request with NVIDIA was based on the assumption that you are using nvcc with its LLVM-derived NVVM component for generating PTX, which is then in turn compiled into SASS (machine code) by NVIDIA’s proprietary optimizing compiler ptxas.

Since PTX code represents both a virtual architecture and a compiler intermediate representation, meaningful performance assessments in CUDA context must always consider SASS rather than PTX. In this case the result is unlikely to look much different, as pxtas invoked as part of the CUDA toolchain follows the same conservative approach with regard to floating-point computation that I outline earlier.

njuffa · November 16, 2024, 9:30am

It can be non-trivial to determine what code transformations are “legal” with (1) full IEEE-754 compliance (2) specific parts of IEEE-754 compliance removed. It takes a lot of careful reasoning taking into account exactly what properties are guaranteed under various switch settings and what corner cases arise from that. This is also an area where compilers still have many flaws. The Intel “classic” compiler was the most reliable one in this regard, after many years of trying to improve floating-point consistency. But it was abandoned three years ago in favor of a clang-based solution, and it seems Intel engineers are now busy teaching that compiler proper floating-point handling.

I recently was looking at some fairly simple code compiled by the latest clang which contained two calls to sqrtf(). With somewhat relaxed floating-point settings, the first one got translated into a naked square root instruction, while the second one resulted in a call to the system math library. I could not figure out what caused this discrepancy. My goal was to have both sqrtf() instances map directly to a square root instruction. I have yet to resolve that issue.

lanwatch2 · November 16, 2024, 9:36am

Agreed, if I provide numbers, they must be runtime numbers.

I just did compile the float example ptx to SASS for clang and nvcc and they are both identical without -ffast-math. Removal of the arithmetic part with -ffast-math can be seen in the SASS, as you correctly suspected.

Robert_Crovella · November 16, 2024, 2:07pm

lanwatch2:

I meant that using cuclang, for the float kernel:

__global__ void floatKernel(const float *A, float *C, int numElements) {
  int i = blockDim.x * blockIdx.x + threadIdx.x;

    if (i < numElements) 
    {
        C[i] = 0.f;
        C[i] += 1 * A[i];
        C[i] += -1 * A[i]; 
        C[i] += 0 * A[i];
    }
}

I get this PTX:

        ld.param.u64    %rd2, [floatKernel(float const*, float*, int)_param_1];
        cvta.to.global.u64      %rd3, %rd2;
        ld.param.u32    %r1, [floatKernel(float const*, float*, int)_param_2];
        mov.u32         %r2, %ntid.x;
        mov.u32         %r3, %ctaid.x;
        mov.u32         %r4, %tid.x;
        mad.lo.s32      %r5, %r2, %r3, %r4;
        setp.ge.s32     %p1, %r5, %r1;
        @%p1 bra        $L__BB0_2;
        mul.wide.s32    %rd4, %r5, 4;
        add.s64         %rd1, %rd3, %rd4;
        mov.b32         %r6, 0;
        st.global.u32   [%rd1], %r6;
$L__BB0_2:
        ret;

When I click on the godbolt link you provided, I get this PTX:

.visible .entry floatKernel(float const*, float*, int)(
        .param .u64 floatKernel(float const*, float*, int)_param_0,
        .param .u64 floatKernel(float const*, float*, int)_param_1,
        .param .u32 floatKernel(float const*, float*, int)_param_2
)
{

        ld.param.u64    %rd1, [floatKernel(float const*, float*, int)_param_0];
        ld.param.u64    %rd2, [floatKernel(float const*, float*, int)_param_1];
        ld.param.u32    %r2, [floatKernel(float const*, float*, int)_param_2];
        mov.u32         %r3, %ntid.x;
        mov.u32         %r4, %ctaid.x;
        mov.u32         %r5, %tid.x;
        mad.lo.s32      %r1, %r3, %r4, %r5;
        setp.ge.s32     %p1, %r1, %r2;
        @%p1 bra        $L__BB0_2;

        cvta.to.global.u64      %rd3, %rd2;
        mul.wide.s32    %rd4, %r1, 4;
        add.s64         %rd5, %rd3, %rd4;
        mov.u32         %r6, 0;
        st.global.u32   [%rd5], %r6;
        cvta.to.global.u64      %rd6, %rd1;
        add.s64         %rd7, %rd6, %rd4;
        ld.global.f32   %f1, [%rd7];
        add.ftz.f32     %f2, %f1, 0f00000000;
        st.global.f32   [%rd5], %f2;
        ld.global.f32   %f3, [%rd7];
        sub.ftz.f32     %f4, %f2, %f3;
        st.global.f32   [%rd5], %f4;
        ld.global.f32   %f5, [%rd7];
        fma.rn.ftz.f32  %f6, %f5, 0f00000000, %f4;
        st.global.f32   [%rd5], %f6;

$L__BB0_2:
        ret;

Looks different. Strange.

lanwatch2 · November 18, 2024, 3:23pm

Maybe you are looking at the right column? that’s the NVCC output, the middle column is the clang-cu output.

system · December 2, 2024, 3:24pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
What is code compiled with -arch=sm_13 slower? CUDA Programming and Performance	16	3662	April 22, 2009
CUDA and Murphy's Law Some things you may bump into... CUDA Programming and Performance	16	19185	August 21, 2007
A more accurate, performance-competitive implementation of expf() CUDA Programming and Performance	24	8228	November 19, 2017
floating point precision pragma's CUDA Programming and Performance	13	5472	April 13, 2012
New Features in CUDA 7.5 Technical Blog	66	1088	August 10, 2016
Different output of code when not unrolling loop CUDA Programming and Performance	16	1076	August 22, 2022
On the register allocation optimization of cuda compiler CUDA Programming and Performance	12	3253	January 20, 2019
Cuda compiler loop unroll bug? CUDA Programming and Performance	14	2428	October 25, 2017
first install of cuda CUDA Setup and Installation	6	7638	February 12, 2017
CUDA Toolkit and SDK 2.3 betas available to registered developers CUDA Programming and Performance	60	104573	July 22, 2009

Are there plans to implement -ffinite-math-only -fno-signed-zeros?

Related topics