16 bit float operations

The only ‘support’ for the 16 bit float type I am aware of in the standard CUDA SDK is in the CUDA math API ‘Type Casting Intrinsic’ library.

[url]http://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__INTRINSIC__CAST.html#group__CUDA__MATH__INTRINSIC__CAST[/url]

There are a few functions which convert unsigned short values to 32 bit float and back.

Would like to be able to perform 16 bit floating multiply, addition and subtraction (or a FMA if possible), and not sure if there is already some existing CUDA functionality.

I think the texture objects have a built in interpolation ability, but I have not found any examples. Can anyone point me to some examples or documentation on this topic.

Already searched and this was the best thing I found so far;

[url]GPU Programming and Streaming Multiprocessors | 8.1. Memory | InformIT

but wonder if anyone can point to me a code example of half-precision operation in CUDA.

Some kind of FP16 support in Pascal was hinted at by NVIDIA CEO Jen-Hsun Huang during the GTC 2015 keynote. At the moment I don’t think you’ll find much exposed in CUDA that reflects FP16 support. Presumably that will appear in CUDA in time for Pascal support.

I don’t think you’ll find hardware level support (i.e. SASS) for any of the math operations you listed on FP16 in any compute capability up to 5.2.

Tegra X1 has considerable discussion around support for FP16 including FMA:

[url]http://international.download.nvidia.com/pdf/tegra/Tegra-X1-whitepaper-v1.0.pdf[/url]

AFAIK CUDA has not exposed significant support for this capability yet.

I think you found all the FP16 support there currently is. Just like on other platforms (notably ARM) half precision is currently available only as a storage format, but not as a computational format. So the advantage of half precision compared to single precision is in increased storage density and bandwidth reduction. The computation itself needs to happen with float operands. Reading FP16 data from textures automatically expands the data to FP32, making this path particularly efficient. For normal loads, CUDA provides intrinsics for conversion between FP16 and FP32, as you already noted.

A worked example using FP16 textures can be found at [url]https://devtalk.nvidia.com/default/topic/547080/-half-datatype-ieee-754-conformance[/url]. I think you should be able to extend that example code to include interpolation, but I have not tried that myself.

Please note that FP16 arithmetic, if and when it will be supported in CUDA (presumably in the Pascal time frame, as txbob notes), will be accurate to only 3 decimal digits. This means there is only minimal tolerance to accumulated round-off error, assuming that practical applications will likely need final results accurate to at least 8 bits.