error when trying to use half (fp16)

LukeCuda · October 9, 2015, 2:11pm

I am wanting to try the half type and am hitting a problem when converting float to half. I am doing a simple test as follows:

float x = 123;
half h = __float2half(x);

I can use ‘half’ ok, but it fails on __float2half(x). it says

Error	4	error C3861: '__float2half': identifier not found

I have these includes:

#include <vector>
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <cuda_runtime.h>
#include <cuda_fp16.h>
using namespace std;

what am i doing wrong? thanks!
Note: the conversion function is mentioned here: http://devblogs.nvidia.com/parallelforall/new-features-cuda-7-5/

Robert_Crovella · October 9, 2015, 2:14pm

I assume that “simple test” represents host code rather than device code. (It’s better if you provide a complete code, rather than snippets.)

The intrinsic that you are trying to use is not available in host code.

This is mentioned in the documentation for the half intrinsics:
[url]CUDA Math API :: CUDA Toolkit Documentation

You may want to read the answer here:

[url]Can anyone provide sample code demonstrating the use of 16 bit floating point in cuda? - Stack Overflow

LukeCuda · October 10, 2015, 12:17am

i would rather not try to put my floats on the device and then convert them, because the whole point of half is to save memory. so if the floats arent going to fit, then how do i convert them on the device?!? obviously it will have to be done piece by piece and that is a big pain in the ass. so i guess device code is just not the right way to go about it. what were the nvidia engineers thinking?

I am reading now that arithmetic and the math libraries dont appear to support half either. I think its a little early for Nvidia to announce these things if they are unusable in their current form. It gets one excited only to be let down. Seems to be a very common trend with Cuda over the years. Oh well back to dreaming about 'half’ing my gpu hardware costs. Wake my up when Nvidia releases full support.

CudaaduC · October 10, 2015, 12:31am

I worked with this new type which is detailed here;

[url]https://devtalk.nvidia.com/default/topic/880571/test-of-new-16-bit-float-half-type-in-cuda-7-5/[/url]

Rather than moaning about having to convert on the device (which can be faster than using the host), just break up into a few groups and write your own kernel to make the type conversion.

allocate half device memory
allocate temporary float device memory of sub-set size
copy first chunk of host to device (float)
run kernel which does element-wise conversion to half and put into respective place in half buffer
copy second chunk from host to device
run kernel on that subset
etc…

when done free the device float buffer and you are done

Is your buffer of floats so large that is cannot fit in device memory? The Titan X has 12GB, the GTX 980ti 6GB and the GTC 980 4GB.

CudaaduC · October 10, 2015, 12:41am

Also you can do basic math operations on the half type;

[url]http://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH____HALF__ARITHMETIC.html#group__CUDA__MATH____HALF__ARITHMETIC[/url]

but the better idea is to cast back up to float, do the computation in 32 bit, then cast that result back down to 16 bit half for storage.

njuffa · October 10, 2015, 3:31am

Like other such CUDA intrinsics starting with a double underscore, __float2half() is a device function that cannot be used in host code.

Since host-side conversion from float (fp32) to half (fp16) is desired, it would make sense to check the host compiler documentation for support. I am reasonably certain that current ARM tool chains support this but do not know chapter and verse. For Intel platforms, there is this intrinsic:

[url]Intel® Intrinsics Guide

Supported on Ivybridge and Haswell according to the documentation linked above. If past history is any indication, the Intel-defined intrinsics usually show up in identical fashion across the tool chains for Windows, Linux, and Mac.

[Later:] Apparently on ARM platforms, half precision is implemented as the ‘__fp16’ type. See this question & answer: [url]gcc - __fp16 type undefined in GNU ARM C++ - Stack Overflow. Don’t know whether it’s supported on NVIDIA’s ARM platforms, but it definitely seems worth a try.

Robert_Crovella · October 10, 2015, 2:41pm

LukeCuda:

i would rather not try to put my floats on the device and then convert them, because the whole point of half is to save memory. so if the floats arent going to fit, then how do i convert them on the device?!? obviously it will have to be done piece by piece and that is a big pain in the ass. so i guess device code is just not the right way to go about it. what were the nvidia engineers thinking?

I am reading now that arithmetic and the math libraries dont appear to support half either. I think its a little early for Nvidia to announce these things if they are unusable in their current form. It gets one excited only to be let down. Seems to be a very common trend with Cuda over the years. Oh well back to dreaming about 'half’ing my gpu hardware costs. Wake my up when Nvidia releases full support.

I’m sorry you’re unhappy with the current feature set.

The implementation is admittedly limited right now. If you read the stackoverflow article I linked, it pointed out some of those limitations and also gave a rationale for why it might be useful now in spite of those limitations, and also gave some hints as to how it might be improved later.

Regarding “unusable in their current form”, I think that just because you don’t understand how to take advantage of it now, or that it doesn’t seem to help for your particular use case, does not mean that there are no possible uses for the current functionality, or that it makes no sense to release that functionality now.

As njuffa pointed out, keep in mind that the “nvidia engineers” are not supplying the host compiler, and have little control over it. Therefore, effective usage of the half datatype for some use cases may depend on host compiler support, and/or your ability to figure out how to do what you want in the host environment.

There are definitely some use cases now, as hinted at in the stackoverflow article, where this functionality is very interesting to a set of non “nvidia engineers”.

Having said all that, from my perspective, and maybe from yours, one glaring deficiency is not making basic half-to-float and float-to-half conversion intrinsics usable in host code. A couple suggestions:

File an RFE (bug report) with NVIDIA. Customer feedback does definitely drive development paths moving forward, although I can’t make specific promises in specific cases.
Write your own? I don’t think it would be that hard. The nvidia engineers have no magical control over the host compiler or the host processor. Any host-based implementation of half/float conversion routines would ultimately have to be something that complies with language specifications, and host compiler capability, and host processor capability. They would be subject to the same constraints that you would be. If you did write those conversion routines, and made them available to the community, others might find that quite useful as well.

Robert_Crovella · October 10, 2015, 2:46pm

Those basic math operations are supported via intrinsics, and those particular intrinsics are only defined for device architectures that natively support such operations. Today that is (AFAIK) exactly 1 device: Tegra TX1. In the future, presumably other devices will provide native support (Jensen essentially said as much at GTC2015). So for most devices extant today, conversion to float is the only option (again, AFAIK) for math operations, even on the device.

CudaaduC · October 10, 2015, 5:21pm

txbob:

CudaaduC:

Also you can do basic math operations on the half type;

http://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH____HALF__ARITHMETIC.html#group__CUDA__MATH____HALF__ARITHMETIC

but the better idea is to cast back up to float, do the computation in 32 bit, then cast that result back down to 16 bit half for storage.

Those basic math operations are supported via intrinsics, and those particular intrinsics are only defined for device architectures that natively support such operations. Today that is (AFAIK) exactly 1 device: Tegra TX1. In the future, presumably other devices will provide native support (Jensen essentially said as much at GTC2015). So for most devices extant today, conversion to float is the only option (again, AFAIK) for math operations, even on the device.

Noted. I had not actually used those instrinsics as I used the conversion functions back to float for computation in 32 bit.

I found this type very useful, and for our back projection algorithms use of this type has resulted in about a 70% increase in performance with only a very small loss in accuracy.

In my mind the main benefit of using this type is that ability to (in some circumstances) to effectively double the amount of shared memory space per thread block. By that I mean now I can store twice as many 16 bit float values as I would 32 bit float values. This enables larger look up tables and scratch pads for intermediate calculations.

njuffa · October 10, 2015, 5:33pm

For those who would like to use portable code for float-to-half conversions, rather than the platform-specific means that I pointed out earlier, you may want to consider the code below. I only implemented the round-to-nearest-or-even variant. I tested this exhaustively against the device function __float2half_rn() on an sm_50 GPU, which maps to the F2F.F16.F32 instruction.

/*
  Copyright (c) 2015, Norbert Juffa
  All rights reserved.

  Redistribution and use in source and binary forms, with or without 
  modification, are permitted provided that the following conditions
  are met:

  1. Redistributions of source code must retain the above copyright 
     notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright
     notice, this list of conditions and the following disclaimer in the
     documentation and/or other materials provided with the distribution.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 
  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 
  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 
  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 
  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

__fp16 uint16_as_fp16 (uint16_t a)
{
    __fp16 res;
#if defined (__cplusplus)
    memcpy (&res, &a, sizeof (res));
#else /* __cplusplus */
    volatile union {
        __fp16 f;
        uint16_t i;
    } cvt;
    cvt.i = a;
    res = cvt.f;
#endif /* __cplusplus */
    return res;
}

uint32_t fp32_as_uint32 (float a)
{
    uint32_t res;
#if defined (__cplusplus)
    memcpy (&res, &a, sizeof (res));
#else /* __cplusplus */
    volatile union {
        float f;
        uint32_t i;
    } cvt;
    cvt.f = a;
    res = cvt.i;
#endif /* __cplusplus */
    return res;
}

/* host version of device function __float2half_rn() */
__fp16 float2half_rn (float a)
{
    uint32_t ia = fp32_as_uint32 (a);
    uint16_t ir;

    ir = (ia >> 16) & 0x8000;
    if ((ia & 0x7f800000) == 0x7f800000) {
        if ((ia & 0x7fffffff) == 0x7f800000) {
            ir |= 0x7c00; /* infinity */
        } else {
            ir = 0x7fff; /* canonical NaN */
        }
    } else if ((ia & 0x7f800000) >= 0x33000000) {
        int shift = (int)((ia >> 23) & 0xff) - 127;
        if (shift > 15) {
            ir |= 0x7c00; /* infinity */
        } else {
            ia = (ia & 0x007fffff) | 0x00800000; /* extract mantissa */
            if (shift < -14) { /* denormal */  
                ir |= ia >> (-1 - shift);
                ia = ia << (32 - (-1 - shift));
            } else { /* normal */
                ir |= ia >> (24 - 11);
                ia = ia << (32 - (24 - 11));
                ir = ir + ((14 + shift) << 10);
            }
            /* IEEE-754 round to nearest of even */
            if ((ia > 0x80000000) || ((ia == 0x80000000) && (ir & 1))) {
                ir++;
            }
        }
    }
    return uint16_as_fp16 (ir);
}

Robert_Crovella · October 10, 2015, 6:10pm

nice work njuffa. (From my perspective it looks harder than I thought.)

Seems like a great example and clever usage that depends only on the device-level functionality already exposed.

LukeCuda · October 11, 2015, 4:36pm

thanks for the replies. i think i can use half effectively now for what i need. it is sgemmEx that really shines in the current implementation. so for all the extra operations just convert to float on the fly like below:

// c = a * b
 half c = __float2half(__half2float(a)*__half2float(b));

HannesF99 · October 12, 2015, 8:05am

We use the ‘half_float’ library on the host.
http://half.sourceforge.net/
It conforms to the IEEE 754 standard.
It is slow (even compared to CPU float), but serves us the purpose of having a ‘gold’ CPU implementation against which we compare the GPU impl.

njuffa · October 12, 2015, 4:47pm

I cannot speak to the quality of this library, but I wonder whether using it as a “golden” reference is justified. In particular, I would be concerned about issues with double rounding, based on the following description on the website you pointed to:

“arithmetic operations are internally rounded to single-precision using the underlying single-precision implementation’s current rounding mode, those values are then converted to half-precision using the default half-precision rounding mode”.

I do not recall the exact details of when double rounding is without problems, as it differs by operation, but seem to recall that it requires > 2*p+2 bits for the wider format for some of them, which would mean > 24 for p=11.

[Later:]
S. A. Figueroa: “When is double rounding innocuous?”. SIGNUM Newsletter 30(3), 21-26 (1995) showed that double rounding is innocuous if q >= 2p for multiplication and division, q >= 2p+1 for addition, and q >= 2p+2 for square root. Using binary32 for binary16 computation we have q=24 and p=11, so these operations would be safe.

However, this may not apply to other operations, such as FMA (fused multiply-add), rsqrt, or integer-to-float conversions. For example, in an analogous case of converting 64-bit integers to binary32, it has been shown that f32_s64(n) != f32_f64(f64_s64(n)) and similarly for u64. [Sylvie Boldo, Jacques-Henri Jourdan, Xavier Leroy, and Guillaume Melquiond: “Verified Compilation of Floating-Point Computations”. Journal of Automated Reasoning 54 (2), 135-163 (2015)].

[Even later:]
Cristina Iordache and David W. Matula: “On Infinitely Precise Rounding for Division, Square Root,
Reciprocal and Square Root Reciprocal”. In Proceedings of the 14th IEEE Symposium on Computer Arithmetic (ARITH-14) , pp. 233–240, states that rsqrt (reciprocal square root) requires q >= 2p+3 to avoid double rounding issues.

HannesF99 · October 13, 2015, 4:30pm

You have a much deeper understanding of the subtilities of floating-point representation and operations on floating-point numbers.
I just can say from a practical side, that we use it as the ‘CPU counterpart’, and it serves our needs quite well. We always first write the CPU routine and from that then the GPU kernel for our image procesing routines, and when comparing the results we get very similar results.
On the CPU, we use the CPU ‘half_float’ class for storage and also its overloaded mathematic operations (i am not sure whether they convert to float32 internally and doing the calculation in float32 precision), whereas on the GPU we use the half type only for storage, but convert values with ‘half’ datatype to float32 before applying arithmetic operations on it.

HannesF99 · October 13, 2015, 4:47pm

<< deleted, double posting >>

njuffa · October 13, 2015, 4:54pm

As long as you are aware of possible limitations for both CPU-side and GPU-side computations, that seems perfectly fine.

My pet peeve is the approach employed by some SW developers that declares any computation on the CPU as “golden”, no matter how achieved, and treats the equivalent computation on the GPU as “in error” if the results do not match bit-wise or are at least extremely close.

Fact is, instructions on CPUs can be buggy, host compilers may take numerical shortcuts to improve performance (even at default settings), some host platforms can have issues with double rounding, older CPUs lack support for FMA, etc., etc. So one would want to be very cautious in general with designating the results from CPU computation as “golden”.