'half' datatype - IEEE 754 conformance

HannesF99 · June 12, 2013, 7:09am

Hi, wanted to know if the ‘half’ 16 bit floating point type available in CUDA conforms tothe IEEE 754 specification of ‘binary16’. See Half-precision floating-point format - Wikipedia for more info to the IEEE 754 16-bit floating point format.

cmaster.matso · June 12, 2013, 11:12am

And what about this topic here?

MK

HannesF99 · June 12, 2013, 11:45am

the posting at the mentioned link do not give information wehther the half datatype is IEEE 754 conforming or not.

seibert · June 12, 2013, 3:21pm

Although a little indirect, the CUDA C Programming Guide does mention in the section on 16-bit floating point textures that the data type is “the same as the IEEE 754-2008 binary2 format”.

Also, according to the PTX manual, all half-float values (which they call f16 in the PTX manual) have to be upconverted to single or double precision floats before arithmetic can be performed on them, so the only relevant standards question is that of bit layout.

njuffa · June 12, 2013, 4:18pm

While CUDA supports no arithmetic operations on half floats, the question is somewhat broader then one of bit layout, as it extends to the semantics of encodings. For example, is there support for denormals? Unfortunately I do not know the answer to that question, not having used half floats.

njuffa · June 13, 2013, 12:58am

As noted in the CUDA C Programming Guide, the bit layout of ‘half’ operands on the GPU is identical to the 16-bit floating-point format specified by IEEE-754:2008.

As mentioned, CUDA does not provide any arithmetic operation for ‘half’ operands, just conversions to and from float. I checked out the conversion machine instructions, F2F.F32.F16 and F2F.F16.F32, which are exposed as device functions __half2float() and __float2half_rn() in CUDA. ‘half’ operands must be stored as unsigned short operands, since CUDA lacks a dedicated floating-point type ‘half’.

The GPU ‘half’ format has denormal support, and underflow to denormal or zero during float-to-‘half’ conversion works as required by IEEE-754. During float-to-‘half’ conversion all float NaN encodings are mapped to a single canonical ‘half’ NaN, 0x7FFF. During ‘half’-to-float conversion all ‘half’ NaN encodings are mapped to a single canonical float NaN, 0x7FFFFFFF. The use of canonical NaNs is compliant with IEEE-754. Infinities are mapped to equivalent encodings during conversion in either direction and overflow to infinity during float->‘half’ conversion works as required by IEEE-754.

I conclude that in as far as operations on ‘half’ are provided, they are in compliance with the IEEE-754:2008 specification for a 16-bit floating-point type.

I have not had any luck yet getting a 16-bit floating-point texture to work from the CUDA runtime (I never use the driver API). I am still digging on that. According to the CUDA C Programming Guide there is support in the CUDA runtime API for textures with ‘half’ elements, with the data getting converted to float on access.

HannesF99 · June 13, 2013, 7:38am

Thanx to all (especially njuffa) for the great update ! We plan to use ‘half’ only in order to save memory bandwidth , and also in order to reduce memory usage (we are dealing with - sometimes very big - images). The arithmetics on these values will be done always in ‘float’. Texture read support from a ‘half’ image would be something i suppose would be really helpful for us to have available in the cuda runtime api (or driver api - doesn’t matter as one can mix runtime and driver api afaik) - so would be nice if we get here some information from ‘njuffa’ or someone else from NVIDIA, thx in advance …).

njuffa · June 14, 2013, 2:25am

The missing piece of the puzzle with regard to 16-bit ‘half’ data for textures was that the required CUDA runtime API functions are missing from the documentation. A bug has been filed. Here is a minimal demo app showing how to access 16-bit ‘half’ textures:

#include <stdlib.h>
#include <stdio.h>

// Macro to catch CUDA errors in CUDA runtime calls
#define CUDA_SAFE_CALL(call)                                          \
do {                                                                  \
    cudaError_t err = call;                                           \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)
// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR()                                          \
do {                                                                  \
    /* Check synchronous errors, i.e. pre-launch */                   \
    cudaError_t err = cudaGetLastError();                             \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
    /* Check asynchronous errors, i.e. kernel failed (ULF) */         \
    err = cudaThreadSynchronize();                                    \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString( err) );      \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

texture<float, 2> tex;

__global__ void kernel (int m, int n) 
{
    float val;
    for (int row = 0; row < m; row++) {
        for (int col = 0; col < n; col++) {
            val = tex2D (tex, col + 0.5f, row + 0.5f);
            printf ("% 15.8e   ", val);
        }
        printf ("\n");
    }
}

int main (void)
{
    int m = 4; // height = #rows
    int n = 3; // width  = #columns
    size_t pitch, tex_ofs;
    unsigned short arr[4][3]= {{0x0000,0x0001,0x0002},  // zero, denormals
                               {0x3c00,0x3c01,0x3c02},  // 1.0 + eps
                               {0x4000,0x4001,0x4002},  // 2.0 + eps
                               {0x7c00,0x7c01,0x7c02}}; // infinity, NaNs
    unsigned short *arr_d = 0;
    cudaChannelFormatDesc channelDesc = cudaCreateChannelDescHalf(); 
    CUDA_SAFE_CALL(cudaMallocPitch((void**)&arr_d,&pitch,n*sizeof(*arr_d),m));
    CUDA_SAFE_CALL(cudaMemcpy2D(arr_d, pitch, arr, n*sizeof(arr[0][0]),
                                n*sizeof(arr[0][0]),m,cudaMemcpyHostToDevice));
    CUDA_SAFE_CALL (cudaBindTexture2D (&tex_ofs, &tex, arr_d, &channelDesc,
                                       n, m, pitch));
    if (tex_ofs !=0) {
        printf ("tex_ofs = %zu\n", tex_ofs);
        return EXIT_FAILURE;
    }
    kernel<<<1,1>>>(m, n);
    CHECK_LAUNCH_ERROR();
    CUDA_SAFE_CALL (cudaDeviceSynchronize());
    CUDA_SAFE_CALL (cudaFree (arr_d));
    return EXIT_SUCCESS;
}

The output of this app should be as follows (please note that how infinities and NaNs are printed is host-system specific):

0.00000000e+000 5.96046448e-008 1.19209290e-007
1.00000000e+000 1.00097656e+000 1.00195313e+000
2.00000000e+000 2.00195313e+000 2.00390625e+000
1.#INF0000e+000 1.#QNAN000e+000 1.#QNAN000e+000

HannesF99 · June 15, 2013, 11:33am

Thx, njuffa - so looks like we have all in place to use ‘half’ in our upcoming GPU implementation (stereo matcher). As usual, this forum is a great place to get some information and help for us!

cbuchner1 · July 9, 2015, 11:14am

are the template arguments for the texture reference “tex” supposed to be

unsigned short, 2, cudaReadModeNormalizedFloat ?

The forum code has cut out the template arguments, probably to prevent malicious HTML tag injection

I have trouble getting this to run on CUDA 7.5 RC (Windows 7, GTX 660Ti). Maybe someone can verify?

cudaBindTexture2D returns an invalid argument error.

cbuchner1 · July 9, 2015, 11:23am

Nevermind, got it to run with

float, 2

as template arguments.

I might try to rewrite this example to use a texture object instead.

Christian

scottgray · July 9, 2015, 5:19pm

In my hgemm and hconv kernels I’ve been avoiding the texture loads. The F2F calls are dual issued with compute instructions and I’ve had no problems hiding their latencies (~16 clocks). Conversion throughput is a quarter that of compute.

Another useful feature on sm_52 that’s supported in the graphics API but not yet in cuda is atomics on half2 values.

RED.E.ADD.F16x2.FTZ.RN (global reduce)
ATOM.E.ADD.F16x2.FTZ.RN (global atomic add)
ATOMS.ADD.F16x2.FTZ.RN (shared atomic add (I think this exists))

I’ve made good use of these in my convolution kernels. Now that there is an actual half2 type in cuda it’s probably time they expose those instructions.

Mark Harris also has a little writup on half support:
http://devblogs.nvidia.com/parallelforall/new-features-cuda-7-5/

njuffa · July 9, 2015, 5:22pm

Argh. The missing texture parameters seem more like a bug in the forum software. Note how the #includes have likewise been mutilated. Use of angled brackets inside a “code” block worked just fine before.

HannesF99 · July 10, 2015, 8:36am

@cbuchner: We don’t use texture references, I don’t like them. We use only texture objects (CC >= 3.0).

channel format descriptor for 16-bit half looks like this for us:
cudaCreateChannelDesc(16, 0, 0, 0, cudaChannelFormatKindUnsigned)

We don’t use bilinear interpolation on ‘half’ images (filterMode is ‘cudaFilterModePoint’), and I am not sure whether that works.

After loading a 16-bit vvalue as ‘short’, we ‘unwrap’ it to a 32-bit float with function ‘__half2float’.

I can see from latest cuda releases that the topic of 16-bit floats (both as storage format and also for computations) got a lot attentation, thats good. We do all computations in 32-bit float, to be on safe side, so we ‘only’ get the benefits from the memory bandwidth we save.

cbuchner1 · July 10, 2015, 9:08am

@HannesF99: I prefer performing direct floating point access to the half float texture.

I’ve already tested bilinear interpolation. It works just fine in the above sample code when substituting row + 0.5f against row + 1.0f. You’ll see values interpolated between the 1.0+epsilon and 2.0+epsilon values, for example.

I’ve modified the texture reference to specify the filter and clamping mode in its constructor.

texture< float, 2 > tex(0, cudaFilterModeLinear, cudaAddressModeClamp, cudaCreateChannelDescHalf());

We will probably use layered cudaArrays to store complex valued MIMO channel matrices of size 8x8 and 16x16. One dimension is frequency, the other dimension is time. The layer specifies the matrix element. It’s comes in very handy that bilinear interpolation across frequency and time is possible.

For the complex valued data one can use a texture reference like this, where the x and y components contain real and imaginary parts.

texture< float2, 2 > tex(0, cudaFilterModeLinear, cudaAddressModeClamp, cudaCreateChannelDescHalf2());

HannesF99 · July 10, 2015, 10:26am

By the way: If someone needs a bit-identical 16-bit float equiavelent on the CPU, there is a ‘half_float’ class at http://half.sourceforge.net/

It is slow, but useful for implementing the CPU reference code against which the GPU routines is compared against. We usually first implement a CPU reference code and then the optimized GPU implementation.

cbuchner1 · July 15, 2015, 9:52am

For those interested, I’ve updated the code sample to use texture objects and to store 2-element vectors (e.g. complex numbers).

I’ve also turned on bilinear interpolation, so you can interpolate across the elements of the array. Test by changing the coordinates of the texture access e.g.

val = tex2D<float2>(tex, col + 0.5f, row + 1.0f);

EDIT: seems I’ve forgot to properly dispose of the texture object before exiting. duh.

// NOTE: this example has been converted to the CUDA texture objects API.
//       Compute Capability 3.0 (Kepler devices or later) are a requirement.
//
// Source: https://devtalk.nvidia.com/default/topic/547080/-half-datatype-ieee-754-conformance/
//
// The output of this app should be as follows (please note that how infinities and NaNs are printed is host - system specific) :
//
// 0.00000000 + i* 0.00000000  0.00000006 + i* 0.00000006  0.00000012 + i* 0.00000012
// 1.00000000 + i* 1.00000000  1.00097656 + i* 1.00097656  1.00195313 + i* 1.00195313
// 2.00000000 + i* 2.00000000  2.00195313 + i* 2.00195313  2.00390625 + i* 2.00390625
// 1.#INF0000 + i* 1.#INF0000  1.#QNAN000 + i* 1.#QNAN000  1.#QNAN000 + i* 1.#QNAN000

#include <cuda_runtime.h>
#include "device_launch_parameters.h"

#include <stdio.h>
#include <memory.h>

// Macro to catch CUDA errors in CUDA runtime calls
#define CUDA_SAFE_CALL(call)                                          \
    do {                                                                  \
        cudaError_t err = call;                                           \
        if (cudaSuccess != err) {                                         \
            fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                     __FILE__, __LINE__, cudaGetErrorString(err) );       \
            exit(EXIT_FAILURE);                                           \
				        }                                                                 \
		    } while (0)
// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR()                                          \
    do {                                                                  \
        /* Check synchronous errors, i.e. pre-launch */                   \
        cudaError_t err = cudaGetLastError();                             \
        if (cudaSuccess != err) {                                         \
            fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                     __FILE__, __LINE__, cudaGetErrorString(err) );       \
            exit(EXIT_FAILURE);                                           \
				        }                                                                 \
        /* Check asynchronous errors, i.e. kernel failed (ULF) */         \
        err = cudaThreadSynchronize();                                    \
        if (cudaSuccess != err) {                                         \
            fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                     __FILE__, __LINE__, cudaGetErrorString( err) );      \
            exit(EXIT_FAILURE);                                           \
				        }                                                                 \
		    } while (0)

__global__ void kernel(cudaTextureObject_t tex, int m, int n)
{
	float2 val;
	for (int row = 0; row < m; row++) {
		for (int col = 0; col < n; col++) {
			val = tex2D<float2>(tex, col + 0.5f, row + 0.5f);
			printf("% 9.8f+i*% 9.8f ", val.x, val.y);
		}
		printf("\n");
	}
}

int main(void)
{
	int m = 4; // height = #rows
	int n = 3; // width  = #columns
	size_t pitch;
	ushort2 arr[4][3] = {
		{ make_ushort2(0x0000, 0x0000), make_ushort2(0x0001, 0x0001), make_ushort2(0x0002, 0x0002) },  // zero, denormals
		{ make_ushort2(0x3c00, 0x3c00), make_ushort2(0x3c01, 0x3c01), make_ushort2(0x3c02, 0x3c02) },  // 1.0 + eps
		{ make_ushort2(0x4000, 0x4000), make_ushort2(0x4001, 0x4001), make_ushort2(0x4002, 0x4002) },  // 2.0 + eps
		{ make_ushort2(0x7c00, 0x7c00), make_ushort2(0x7c01, 0x7c01), make_ushort2(0x7c02, 0x7c02) } }; // infinity, NaNs
	unsigned short *arr_d = 0;
	CUDA_SAFE_CALL(cudaMallocPitch((void**)&arr_d, &pitch, n*sizeof(*arr_d), m));
	CUDA_SAFE_CALL(cudaMemcpy2D(arr_d, pitch, arr, n*sizeof(arr[0][0]), n*sizeof(arr[0][0]), m, cudaMemcpyHostToDevice));

	// create resource description
	cudaResourceDesc resDesc;
	memset(&resDesc, 0, sizeof(resDesc));
	resDesc.resType = cudaResourceTypePitch2D;
	resDesc.res.pitch2D.desc = cudaCreateChannelDescHalf2();
	resDesc.res.pitch2D.devPtr = arr_d;
	resDesc.res.pitch2D.pitchInBytes = pitch;
	resDesc.res.pitch2D.width = n;
	resDesc.res.pitch2D.height = m;

	// create texture description
	cudaTextureDesc texDesc;
	memset(&texDesc, 0, sizeof(texDesc));
	texDesc.readMode = cudaReadModeElementType;
	texDesc.filterMode = cudaFilterModeLinear;
	texDesc.addressMode[0] = cudaAddressModeClamp;
	texDesc.addressMode[1] = cudaAddressModeClamp;

	// create texture object
	cudaTextureObject_t tex = 0;
	cudaCreateTextureObject(&tex, &resDesc, &texDesc, NULL);

	kernel << <1, 1 >> >(tex, m, n);
	CHECK_LAUNCH_ERROR();
	CUDA_SAFE_CALL(cudaDeviceSynchronize());
	CUDA_SAFE_CALL(cudaFree(arr_d));
	return EXIT_SUCCESS;
}

The next challenge for me to tacke will be using layered CUDA arrays, so we can store multiple matrix coefficients, and do interpolation of entire matrices over both frequency (x axis) and time (y axis).

njuffa · July 15, 2015, 11:48am

How was the CPU code for ‘half’ at SourceForge validated?

cbuchner1 · July 15, 2015, 2:25pm

Last code example for today.

Layered CUDA Arrays instead of 2D pitched memory as in the previous example.

// NOTE: this example has been converted to the CUDA texture objects API.
//       Compute Capability 3.0 (Kepler devices or later) are a requirement.
//
// Source: https://devtalk.nvidia.com/default/topic/547080/-half-datatype-ieee-754-conformance/
//
// The output of this app should be as follows (please note that how infinities and NaNs are printed is host - system specific) :
//
// 0.00000000 + i* 0.00000000  0.00000006 + i* 0.00000006  0.00000012 + i* 0.00000012
// 1.00000000 + i* 1.00000000  1.00097656 + i* 1.00097656  1.00195313 + i* 1.00195313
// 2.00000000 + i* 2.00000000  2.00195313 + i* 2.00195313  2.00390625 + i* 2.00390625
// 1.#INF0000 + i* 1.#INF0000  1.#QNAN000 + i* 1.#QNAN000  1.#QNAN000 + i* 1.#QNAN000
//
// 1.#INF0000 + i* 1.#INF0000  1.#QNAN000 + i* 1.#QNAN000  1.#QNAN000 + i* 1.#QNAN000
// 0.00000000 + i* 0.00000000  0.00000006 + i* 0.00000006  0.00000012 + i* 0.00000012
// 1.00000000 + i* 1.00000000  1.00097656 + i* 1.00097656  1.00195313 + i* 1.00195313
// 2.00000000 + i* 2.00000000  2.00195313 + i* 2.00195313  2.00390625 + i* 2.00390625

#include <cuda_runtime.h>
#include "device_launch_parameters.h"

#include <stdio.h>
#include <memory.h>

// Macro to catch CUDA errors in CUDA runtime calls
#define CUDA_SAFE_CALL(call)                                          \
    do {                                                                  \
        cudaError_t err = call;                                           \
        if (cudaSuccess != err) {                                         \
            fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                     __FILE__, __LINE__, cudaGetErrorString(err) );       \
            exit(EXIT_FAILURE);                                           \
						        }                                                                 \
			    } while (0)
// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR()                                          \
    do {                                                                  \
        /* Check synchronous errors, i.e. pre-launch */                   \
        cudaError_t err = cudaGetLastError();                             \
        if (cudaSuccess != err) {                                         \
            fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                     __FILE__, __LINE__, cudaGetErrorString(err) );       \
            exit(EXIT_FAILURE);                                           \
						        }                                                                 \
        /* Check asynchronous errors, i.e. kernel failed (ULF) */         \
        err = cudaThreadSynchronize();                                    \
        if (cudaSuccess != err) {                                         \
            fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                     __FILE__, __LINE__, cudaGetErrorString( err) );      \
            exit(EXIT_FAILURE);                                           \
						        }                                                                 \
			    } while (0)


__global__ void kernel(cudaTextureObject_t tex, int m, int n, int o)
{
	float2 val;
	for (int layer = 0; layer < o; layer++) {
		for (int row = 0; row < m; row++) {
			for (int col = 0; col < n; col++) {
				val = tex2DLayered<float2>(tex, col + 0.5f, row + 0.5f, layer);
				printf("% 9.8f+i*% 9.8f ", val.x, val.y);
			}
			printf("\n");
		}
		printf("\n");
	}
}

int main(void)
{
	int m = 4; // height = #rows
	int n = 3; // width  = #columns
	int o = 2; // depth

	ushort2 arr[2][4][3] = {
		{ { make_ushort2(0x0000, 0x0000), make_ushort2(0x0001, 0x0001), make_ushort2(0x0002, 0x0002) },   // zero, denormals
		  { make_ushort2(0x3c00, 0x3c00), make_ushort2(0x3c01, 0x3c01), make_ushort2(0x3c02, 0x3c02) },   // 1.0 + eps
		  { make_ushort2(0x4000, 0x4000), make_ushort2(0x4001, 0x4001), make_ushort2(0x4002, 0x4002) },   // 2.0 + eps
		  { make_ushort2(0x7c00, 0x7c00), make_ushort2(0x7c01, 0x7c01), make_ushort2(0x7c02, 0x7c02) } }, // infinity, NaNs

	    { { make_ushort2(0x7c00, 0x7c00), make_ushort2(0x7c01, 0x7c01), make_ushort2(0x7c02, 0x7c02) },   // infinity, NaNs
		  { make_ushort2(0x3c00, 0x3c00), make_ushort2(0x3c01, 0x3c01), make_ushort2(0x3c02, 0x3c02) },   // 1.0 + eps
		  { make_ushort2(0x4000, 0x4000), make_ushort2(0x4001, 0x4001), make_ushort2(0x4002, 0x4002) },   // 2.0 + eps
		  { make_ushort2(0x0000, 0x0000), make_ushort2(0x0001, 0x0001), make_ushort2(0x0002, 0x0002) } }, // zero, denormals
	  };

	cudaExtent extent;
	extent.width = n;
	extent.height = m;
	extent.depth = o;

	cudaArray_t array;
	cudaChannelFormatDesc channelDesc = cudaCreateChannelDescHalf2();
	CUDA_SAFE_CALL(cudaMalloc3DArray(&array, &channelDesc, extent, cudaArrayLayered));

	cudaMemcpy3DParms parms;
	memset(&parms, 0, sizeof(parms));
	parms.extent = make_cudaExtent(n, m, o);
	parms.srcPos = make_cudaPos(0, 0, 0);
	parms.srcPtr = make_cudaPitchedPtr(arr, n*sizeof(arr[0][0][0]), n, m);
	parms.dstArray = array;
	parms.dstPos = make_cudaPos(0, 0, 0);
	parms.kind = cudaMemcpyHostToDevice;
	CUDA_SAFE_CALL(cudaMemcpy3D(&parms));
		
	// create resource description
	cudaResourceDesc resDesc;
	memset(&resDesc, 0, sizeof(resDesc));
	resDesc.resType = cudaResourceTypeArray;
	resDesc.res.array.array = array;

	// create texture description
	cudaTextureDesc texDesc;
	memset(&texDesc, 0, sizeof(texDesc));
	texDesc.readMode = cudaReadModeElementType;
	texDesc.filterMode = cudaFilterModeLinear;
	texDesc.addressMode[0] = cudaAddressModeClamp;
	texDesc.addressMode[1] = cudaAddressModeClamp;
	texDesc.addressMode[2] = cudaAddressModeClamp;

	// create texture object
	cudaTextureObject_t tex = 0;
	cudaCreateTextureObject(&tex, &resDesc, &texDesc, NULL);


	kernel << <1, 1 >> >(tex, m, n, o);
	CHECK_LAUNCH_ERROR();
	CUDA_SAFE_CALL(cudaDeviceSynchronize());
	CUDA_SAFE_CALL(cudaFreeArray(array));
	CUDA_SAFE_CALL(cudaDestroyTextureObject(tex));
	return EXIT_SUCCESS;
}

HannesF99 · July 15, 2015, 3:03pm

@njuffa: We ‘validated’ it by setting the image pixel values on the CPU (using the ‘half_float’ class), ttransferring to GPU (with simple cudaMemcopy), manipulating the pixel values in a defined way with a GPU kernel (e.g., adding 3 to every intensity value), transferring it back to CPU, and reading out the values.

So for me, it seems that the CPU class and the GPU ‘near-native’ type are bit-identical, at least for windows platform. Makes sense as the bit representation of ‘half’ is exactly defined by the IEEE standard (at least i think so).

Topic		Replies	Views
error when trying to use half (fp16) CUDA Programming and Performance	16	19788	October 13, 2015
What about half-float? CUDA Programming and Performance	18	29303	October 26, 2017
New Features in CUDA 7.5 Technical Blog	66	1059	August 10, 2016
CUDA 2.1 FAQ Please read before posting CUDA Programming and Performance	10	210948	January 18, 2014
Converting a kernel from floats and ints to halfs is 6x slower CUDA Programming and Performance cuda	14	933	October 16, 2023
CUDA Toolkit and SDK 2.3 betas available to registered developers CUDA Programming and Performance	60	104573	July 22, 2009
Reading R8G8B8A8 texture using tex2D() causes strange result. CUDA Programming and Performance	27	2827	April 28, 2018
CUDA very slow performance CUDA Programming and Performance	21	16447	March 6, 2020
Cuda code performance CUDA Programming and Performance	14	3100	December 16, 2014
Questions about cuFFT for 3D matrix, arrayFire GPU-Accelerated Libraries	5	1645	October 12, 2021

'half' datatype - IEEE 754 conformance

Related topics