Hi, wanted to know if the ‘half’ 16 bit floating point type available in CUDA conforms tothe IEEE 754 specification of ‘binary16’. See Half-precision floating-point format - Wikipedia for more info to the IEEE 754 16-bit floating point format.
And what about this topic here?
MK
the posting at the mentioned link do not give information wehther the half datatype is IEEE 754 conforming or not.
Although a little indirect, the CUDA C Programming Guide does mention in the section on 16-bit floating point textures that the data type is “the same as the IEEE 754-2008 binary2 format”.
Also, according to the PTX manual, all half-float values (which they call f16 in the PTX manual) have to be upconverted to single or double precision floats before arithmetic can be performed on them, so the only relevant standards question is that of bit layout.
While CUDA supports no arithmetic operations on half floats, the question is somewhat broader then one of bit layout, as it extends to the semantics of encodings. For example, is there support for denormals? Unfortunately I do not know the answer to that question, not having used half floats.
As noted in the CUDA C Programming Guide, the bit layout of ‘half’ operands on the GPU is identical to the 16-bit floating-point format specified by IEEE-754:2008.
As mentioned, CUDA does not provide any arithmetic operation for ‘half’ operands, just conversions to and from float. I checked out the conversion machine instructions, F2F.F32.F16 and F2F.F16.F32, which are exposed as device functions __half2float() and __float2half_rn() in CUDA. ‘half’ operands must be stored as unsigned short operands, since CUDA lacks a dedicated floating-point type ‘half’.
The GPU ‘half’ format has denormal support, and underflow to denormal or zero during float-to-‘half’ conversion works as required by IEEE-754. During float-to-‘half’ conversion all float NaN encodings are mapped to a single canonical ‘half’ NaN, 0x7FFF. During ‘half’-to-float conversion all ‘half’ NaN encodings are mapped to a single canonical float NaN, 0x7FFFFFFF. The use of canonical NaNs is compliant with IEEE-754. Infinities are mapped to equivalent encodings during conversion in either direction and overflow to infinity during float->‘half’ conversion works as required by IEEE-754.
I conclude that in as far as operations on ‘half’ are provided, they are in compliance with the IEEE-754:2008 specification for a 16-bit floating-point type.
I have not had any luck yet getting a 16-bit floating-point texture to work from the CUDA runtime (I never use the driver API). I am still digging on that. According to the CUDA C Programming Guide there is support in the CUDA runtime API for textures with ‘half’ elements, with the data getting converted to float on access.
Thanx to all (especially njuffa) for the great update ! We plan to use ‘half’ only in order to save memory bandwidth , and also in order to reduce memory usage (we are dealing with - sometimes very big - images). The arithmetics on these values will be done always in ‘float’. Texture read support from a ‘half’ image would be something i suppose would be really helpful for us to have available in the cuda runtime api (or driver api - doesn’t matter as one can mix runtime and driver api afaik) - so would be nice if we get here some information from ‘njuffa’ or someone else from NVIDIA, thx in advance …).
The missing piece of the puzzle with regard to 16-bit ‘half’ data for textures was that the required CUDA runtime API functions are missing from the documentation. A bug has been filed. Here is a minimal demo app showing how to access 16-bit ‘half’ textures:
#include <stdlib.h>
#include <stdio.h>
// Macro to catch CUDA errors in CUDA runtime calls
#define CUDA_SAFE_CALL(call) \
do { \
cudaError_t err = call; \
if (cudaSuccess != err) { \
fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
__FILE__, __LINE__, cudaGetErrorString(err) ); \
exit(EXIT_FAILURE); \
} \
} while (0)
// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR() \
do { \
/* Check synchronous errors, i.e. pre-launch */ \
cudaError_t err = cudaGetLastError(); \
if (cudaSuccess != err) { \
fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
__FILE__, __LINE__, cudaGetErrorString(err) ); \
exit(EXIT_FAILURE); \
} \
/* Check asynchronous errors, i.e. kernel failed (ULF) */ \
err = cudaThreadSynchronize(); \
if (cudaSuccess != err) { \
fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
__FILE__, __LINE__, cudaGetErrorString( err) ); \
exit(EXIT_FAILURE); \
} \
} while (0)
texture<float, 2> tex;
__global__ void kernel (int m, int n)
{
float val;
for (int row = 0; row < m; row++) {
for (int col = 0; col < n; col++) {
val = tex2D (tex, col + 0.5f, row + 0.5f);
printf ("% 15.8e ", val);
}
printf ("\n");
}
}
int main (void)
{
int m = 4; // height = #rows
int n = 3; // width = #columns
size_t pitch, tex_ofs;
unsigned short arr[4][3]= {{0x0000,0x0001,0x0002}, // zero, denormals
{0x3c00,0x3c01,0x3c02}, // 1.0 + eps
{0x4000,0x4001,0x4002}, // 2.0 + eps
{0x7c00,0x7c01,0x7c02}}; // infinity, NaNs
unsigned short *arr_d = 0;
cudaChannelFormatDesc channelDesc = cudaCreateChannelDescHalf();
CUDA_SAFE_CALL(cudaMallocPitch((void**)&arr_d,&pitch,n*sizeof(*arr_d),m));
CUDA_SAFE_CALL(cudaMemcpy2D(arr_d, pitch, arr, n*sizeof(arr[0][0]),
n*sizeof(arr[0][0]),m,cudaMemcpyHostToDevice));
CUDA_SAFE_CALL (cudaBindTexture2D (&tex_ofs, &tex, arr_d, &channelDesc,
n, m, pitch));
if (tex_ofs !=0) {
printf ("tex_ofs = %zu\n", tex_ofs);
return EXIT_FAILURE;
}
kernel<<<1,1>>>(m, n);
CHECK_LAUNCH_ERROR();
CUDA_SAFE_CALL (cudaDeviceSynchronize());
CUDA_SAFE_CALL (cudaFree (arr_d));
return EXIT_SUCCESS;
}
The output of this app should be as follows (please note that how infinities and NaNs are printed is host-system specific):
0.00000000e+000 5.96046448e-008 1.19209290e-007
1.00000000e+000 1.00097656e+000 1.00195313e+000
2.00000000e+000 2.00195313e+000 2.00390625e+000
1.#INF0000e+000 1.#QNAN000e+000 1.#QNAN000e+000
Thx, njuffa - so looks like we have all in place to use ‘half’ in our upcoming GPU implementation (stereo matcher). As usual, this forum is a great place to get some information and help for us!
are the template arguments for the texture reference “tex” supposed to be
unsigned short, 2, cudaReadModeNormalizedFloat ?
The forum code has cut out the template arguments, probably to prevent malicious HTML tag injection
I have trouble getting this to run on CUDA 7.5 RC (Windows 7, GTX 660Ti). Maybe someone can verify?
cudaBindTexture2D returns an invalid argument error.
Nevermind, got it to run with
float, 2
as template arguments.
I might try to rewrite this example to use a texture object instead.
Christian
In my hgemm and hconv kernels I’ve been avoiding the texture loads. The F2F calls are dual issued with compute instructions and I’ve had no problems hiding their latencies (~16 clocks). Conversion throughput is a quarter that of compute.
Another useful feature on sm_52 that’s supported in the graphics API but not yet in cuda is atomics on half2 values.
RED.E.ADD.F16x2.FTZ.RN (global reduce)
ATOM.E.ADD.F16x2.FTZ.RN (global atomic add)
ATOMS.ADD.F16x2.FTZ.RN (shared atomic add (I think this exists))
I’ve made good use of these in my convolution kernels. Now that there is an actual half2 type in cuda it’s probably time they expose those instructions.
Mark Harris also has a little writup on half support:
http://devblogs.nvidia.com/parallelforall/new-features-cuda-7-5/
Argh. The missing texture parameters seem more like a bug in the forum software. Note how the #includes have likewise been mutilated. Use of angled brackets inside a “code” block worked just fine before.
@cbuchner: We don’t use texture references, I don’t like them. We use only texture objects (CC >= 3.0).
channel format descriptor for 16-bit half looks like this for us:
cudaCreateChannelDesc(16, 0, 0, 0, cudaChannelFormatKindUnsigned)
We don’t use bilinear interpolation on ‘half’ images (filterMode is ‘cudaFilterModePoint’), and I am not sure whether that works.
After loading a 16-bit vvalue as ‘short’, we ‘unwrap’ it to a 32-bit float with function ‘__half2float’.
I can see from latest cuda releases that the topic of 16-bit floats (both as storage format and also for computations) got a lot attentation, thats good. We do all computations in 32-bit float, to be on safe side, so we ‘only’ get the benefits from the memory bandwidth we save.
@HannesF99: I prefer performing direct floating point access to the half float texture.
I’ve already tested bilinear interpolation. It works just fine in the above sample code when substituting row + 0.5f against row + 1.0f. You’ll see values interpolated between the 1.0+epsilon and 2.0+epsilon values, for example.
I’ve modified the texture reference to specify the filter and clamping mode in its constructor.
texture< float, 2 > tex(0, cudaFilterModeLinear, cudaAddressModeClamp, cudaCreateChannelDescHalf());
We will probably use layered cudaArrays to store complex valued MIMO channel matrices of size 8x8 and 16x16. One dimension is frequency, the other dimension is time. The layer specifies the matrix element. It’s comes in very handy that bilinear interpolation across frequency and time is possible.
For the complex valued data one can use a texture reference like this, where the x and y components contain real and imaginary parts.
texture< float2, 2 > tex(0, cudaFilterModeLinear, cudaAddressModeClamp, cudaCreateChannelDescHalf2());
By the way: If someone needs a bit-identical 16-bit float equiavelent on the CPU, there is a ‘half_float’ class at http://half.sourceforge.net/
It is slow, but useful for implementing the CPU reference code against which the GPU routines is compared against. We usually first implement a CPU reference code and then the optimized GPU implementation.
For those interested, I’ve updated the code sample to use texture objects and to store 2-element vectors (e.g. complex numbers).
I’ve also turned on bilinear interpolation, so you can interpolate across the elements of the array. Test by changing the coordinates of the texture access e.g.
val = tex2D<float2>(tex, col + 0.5f, row + 1.0f);
EDIT: seems I’ve forgot to properly dispose of the texture object before exiting. duh.
// NOTE: this example has been converted to the CUDA texture objects API.
// Compute Capability 3.0 (Kepler devices or later) are a requirement.
//
// Source: https://devtalk.nvidia.com/default/topic/547080/-half-datatype-ieee-754-conformance/
//
// The output of this app should be as follows (please note that how infinities and NaNs are printed is host - system specific) :
//
// 0.00000000 + i* 0.00000000 0.00000006 + i* 0.00000006 0.00000012 + i* 0.00000012
// 1.00000000 + i* 1.00000000 1.00097656 + i* 1.00097656 1.00195313 + i* 1.00195313
// 2.00000000 + i* 2.00000000 2.00195313 + i* 2.00195313 2.00390625 + i* 2.00390625
// 1.#INF0000 + i* 1.#INF0000 1.#QNAN000 + i* 1.#QNAN000 1.#QNAN000 + i* 1.#QNAN000
#include <cuda_runtime.h>
#include "device_launch_parameters.h"
#include <stdio.h>
#include <memory.h>
// Macro to catch CUDA errors in CUDA runtime calls
#define CUDA_SAFE_CALL(call) \
do { \
cudaError_t err = call; \
if (cudaSuccess != err) { \
fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
__FILE__, __LINE__, cudaGetErrorString(err) ); \
exit(EXIT_FAILURE); \
} \
} while (0)
// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR() \
do { \
/* Check synchronous errors, i.e. pre-launch */ \
cudaError_t err = cudaGetLastError(); \
if (cudaSuccess != err) { \
fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
__FILE__, __LINE__, cudaGetErrorString(err) ); \
exit(EXIT_FAILURE); \
} \
/* Check asynchronous errors, i.e. kernel failed (ULF) */ \
err = cudaThreadSynchronize(); \
if (cudaSuccess != err) { \
fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
__FILE__, __LINE__, cudaGetErrorString( err) ); \
exit(EXIT_FAILURE); \
} \
} while (0)
__global__ void kernel(cudaTextureObject_t tex, int m, int n)
{
float2 val;
for (int row = 0; row < m; row++) {
for (int col = 0; col < n; col++) {
val = tex2D<float2>(tex, col + 0.5f, row + 0.5f);
printf("% 9.8f+i*% 9.8f ", val.x, val.y);
}
printf("\n");
}
}
int main(void)
{
int m = 4; // height = #rows
int n = 3; // width = #columns
size_t pitch;
ushort2 arr[4][3] = {
{ make_ushort2(0x0000, 0x0000), make_ushort2(0x0001, 0x0001), make_ushort2(0x0002, 0x0002) }, // zero, denormals
{ make_ushort2(0x3c00, 0x3c00), make_ushort2(0x3c01, 0x3c01), make_ushort2(0x3c02, 0x3c02) }, // 1.0 + eps
{ make_ushort2(0x4000, 0x4000), make_ushort2(0x4001, 0x4001), make_ushort2(0x4002, 0x4002) }, // 2.0 + eps
{ make_ushort2(0x7c00, 0x7c00), make_ushort2(0x7c01, 0x7c01), make_ushort2(0x7c02, 0x7c02) } }; // infinity, NaNs
unsigned short *arr_d = 0;
CUDA_SAFE_CALL(cudaMallocPitch((void**)&arr_d, &pitch, n*sizeof(*arr_d), m));
CUDA_SAFE_CALL(cudaMemcpy2D(arr_d, pitch, arr, n*sizeof(arr[0][0]), n*sizeof(arr[0][0]), m, cudaMemcpyHostToDevice));
// create resource description
cudaResourceDesc resDesc;
memset(&resDesc, 0, sizeof(resDesc));
resDesc.resType = cudaResourceTypePitch2D;
resDesc.res.pitch2D.desc = cudaCreateChannelDescHalf2();
resDesc.res.pitch2D.devPtr = arr_d;
resDesc.res.pitch2D.pitchInBytes = pitch;
resDesc.res.pitch2D.width = n;
resDesc.res.pitch2D.height = m;
// create texture description
cudaTextureDesc texDesc;
memset(&texDesc, 0, sizeof(texDesc));
texDesc.readMode = cudaReadModeElementType;
texDesc.filterMode = cudaFilterModeLinear;
texDesc.addressMode[0] = cudaAddressModeClamp;
texDesc.addressMode[1] = cudaAddressModeClamp;
// create texture object
cudaTextureObject_t tex = 0;
cudaCreateTextureObject(&tex, &resDesc, &texDesc, NULL);
kernel << <1, 1 >> >(tex, m, n);
CHECK_LAUNCH_ERROR();
CUDA_SAFE_CALL(cudaDeviceSynchronize());
CUDA_SAFE_CALL(cudaFree(arr_d));
return EXIT_SUCCESS;
}
The next challenge for me to tacke will be using layered CUDA arrays, so we can store multiple matrix coefficients, and do interpolation of entire matrices over both frequency (x axis) and time (y axis).
How was the CPU code for ‘half’ at SourceForge validated?
Last code example for today.
Layered CUDA Arrays instead of 2D pitched memory as in the previous example.
// NOTE: this example has been converted to the CUDA texture objects API.
// Compute Capability 3.0 (Kepler devices or later) are a requirement.
//
// Source: https://devtalk.nvidia.com/default/topic/547080/-half-datatype-ieee-754-conformance/
//
// The output of this app should be as follows (please note that how infinities and NaNs are printed is host - system specific) :
//
// 0.00000000 + i* 0.00000000 0.00000006 + i* 0.00000006 0.00000012 + i* 0.00000012
// 1.00000000 + i* 1.00000000 1.00097656 + i* 1.00097656 1.00195313 + i* 1.00195313
// 2.00000000 + i* 2.00000000 2.00195313 + i* 2.00195313 2.00390625 + i* 2.00390625
// 1.#INF0000 + i* 1.#INF0000 1.#QNAN000 + i* 1.#QNAN000 1.#QNAN000 + i* 1.#QNAN000
//
// 1.#INF0000 + i* 1.#INF0000 1.#QNAN000 + i* 1.#QNAN000 1.#QNAN000 + i* 1.#QNAN000
// 0.00000000 + i* 0.00000000 0.00000006 + i* 0.00000006 0.00000012 + i* 0.00000012
// 1.00000000 + i* 1.00000000 1.00097656 + i* 1.00097656 1.00195313 + i* 1.00195313
// 2.00000000 + i* 2.00000000 2.00195313 + i* 2.00195313 2.00390625 + i* 2.00390625
#include <cuda_runtime.h>
#include "device_launch_parameters.h"
#include <stdio.h>
#include <memory.h>
// Macro to catch CUDA errors in CUDA runtime calls
#define CUDA_SAFE_CALL(call) \
do { \
cudaError_t err = call; \
if (cudaSuccess != err) { \
fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
__FILE__, __LINE__, cudaGetErrorString(err) ); \
exit(EXIT_FAILURE); \
} \
} while (0)
// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR() \
do { \
/* Check synchronous errors, i.e. pre-launch */ \
cudaError_t err = cudaGetLastError(); \
if (cudaSuccess != err) { \
fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
__FILE__, __LINE__, cudaGetErrorString(err) ); \
exit(EXIT_FAILURE); \
} \
/* Check asynchronous errors, i.e. kernel failed (ULF) */ \
err = cudaThreadSynchronize(); \
if (cudaSuccess != err) { \
fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
__FILE__, __LINE__, cudaGetErrorString( err) ); \
exit(EXIT_FAILURE); \
} \
} while (0)
__global__ void kernel(cudaTextureObject_t tex, int m, int n, int o)
{
float2 val;
for (int layer = 0; layer < o; layer++) {
for (int row = 0; row < m; row++) {
for (int col = 0; col < n; col++) {
val = tex2DLayered<float2>(tex, col + 0.5f, row + 0.5f, layer);
printf("% 9.8f+i*% 9.8f ", val.x, val.y);
}
printf("\n");
}
printf("\n");
}
}
int main(void)
{
int m = 4; // height = #rows
int n = 3; // width = #columns
int o = 2; // depth
ushort2 arr[2][4][3] = {
{ { make_ushort2(0x0000, 0x0000), make_ushort2(0x0001, 0x0001), make_ushort2(0x0002, 0x0002) }, // zero, denormals
{ make_ushort2(0x3c00, 0x3c00), make_ushort2(0x3c01, 0x3c01), make_ushort2(0x3c02, 0x3c02) }, // 1.0 + eps
{ make_ushort2(0x4000, 0x4000), make_ushort2(0x4001, 0x4001), make_ushort2(0x4002, 0x4002) }, // 2.0 + eps
{ make_ushort2(0x7c00, 0x7c00), make_ushort2(0x7c01, 0x7c01), make_ushort2(0x7c02, 0x7c02) } }, // infinity, NaNs
{ { make_ushort2(0x7c00, 0x7c00), make_ushort2(0x7c01, 0x7c01), make_ushort2(0x7c02, 0x7c02) }, // infinity, NaNs
{ make_ushort2(0x3c00, 0x3c00), make_ushort2(0x3c01, 0x3c01), make_ushort2(0x3c02, 0x3c02) }, // 1.0 + eps
{ make_ushort2(0x4000, 0x4000), make_ushort2(0x4001, 0x4001), make_ushort2(0x4002, 0x4002) }, // 2.0 + eps
{ make_ushort2(0x0000, 0x0000), make_ushort2(0x0001, 0x0001), make_ushort2(0x0002, 0x0002) } }, // zero, denormals
};
cudaExtent extent;
extent.width = n;
extent.height = m;
extent.depth = o;
cudaArray_t array;
cudaChannelFormatDesc channelDesc = cudaCreateChannelDescHalf2();
CUDA_SAFE_CALL(cudaMalloc3DArray(&array, &channelDesc, extent, cudaArrayLayered));
cudaMemcpy3DParms parms;
memset(&parms, 0, sizeof(parms));
parms.extent = make_cudaExtent(n, m, o);
parms.srcPos = make_cudaPos(0, 0, 0);
parms.srcPtr = make_cudaPitchedPtr(arr, n*sizeof(arr[0][0][0]), n, m);
parms.dstArray = array;
parms.dstPos = make_cudaPos(0, 0, 0);
parms.kind = cudaMemcpyHostToDevice;
CUDA_SAFE_CALL(cudaMemcpy3D(&parms));
// create resource description
cudaResourceDesc resDesc;
memset(&resDesc, 0, sizeof(resDesc));
resDesc.resType = cudaResourceTypeArray;
resDesc.res.array.array = array;
// create texture description
cudaTextureDesc texDesc;
memset(&texDesc, 0, sizeof(texDesc));
texDesc.readMode = cudaReadModeElementType;
texDesc.filterMode = cudaFilterModeLinear;
texDesc.addressMode[0] = cudaAddressModeClamp;
texDesc.addressMode[1] = cudaAddressModeClamp;
texDesc.addressMode[2] = cudaAddressModeClamp;
// create texture object
cudaTextureObject_t tex = 0;
cudaCreateTextureObject(&tex, &resDesc, &texDesc, NULL);
kernel << <1, 1 >> >(tex, m, n, o);
CHECK_LAUNCH_ERROR();
CUDA_SAFE_CALL(cudaDeviceSynchronize());
CUDA_SAFE_CALL(cudaFreeArray(array));
CUDA_SAFE_CALL(cudaDestroyTextureObject(tex));
return EXIT_SUCCESS;
}
@njuffa: We ‘validated’ it by setting the image pixel values on the CPU (using the ‘half_float’ class), ttransferring to GPU (with simple cudaMemcopy), manipulating the pixel values in a defined way with a GPU kernel (e.g., adding 3 to every intensity value), transferring it back to CPU, and reading out the values.
So for me, it seems that the CPU class and the GPU ‘near-native’ type are bit-identical, at least for windows platform. Makes sense as the bit representation of ‘half’ is exactly defined by the IEEE standard (at least i think so).