What about half-float?

I am considering using 16-bit floating point for some of the floating-point data in my kernel, to fit more of my data in shared memory. But glancing over the CUDA manual, it seems this data type isn’t supported in CUDA unless you use the limited array type? Is there any chance of general 16-bit floating point types being added in the near future?

Trying to emulate it with code seems rather akward. And using 16-bit fixed-point might not be accurate enough.

The driver API supports textures that contain 16-bit floats through the CU_AD_FORMAT_HALF array format. The values are automatically promoted to 32-bit during the texture read.

16-bit float textures are planned for a future release of CUDART.

Other support for 16-bit floats, such as enabling kernels to convert between 16- and 32-bit floats (to read/write float16 while processing float32), also is planned for a future release.

“The values are automatically promoted to 32-bit during the texture read.”

How to declare the 16-bit float texture reference?

The part you did not quote says: “The driver API supports textures that contain 16-bit floats through the CU_AD_FORMAT_HALF array format.”

The simpleTextureDrv SDK example shows how to use textures with the driver API

Yes, the driver API function does support texture that contain 16-bit floats like this:

cuTexRefSetFormat(cu_texref, CU_AD_FORMAT_HALF, 1)

But in the kernel file, you have to declare a texture reference for the 16-bit floats, some format like this:

texture<[font="Arial Black"]float_half[/font], 2, cudaReadModeElementType> tex;

The problem is that there is no such a data type “float_half”, how should I do?

Forum software go a little nuts with the quintuple posting?

The value you pass into the template parameter is the value that tex2D will return, so it should be float. The hardware knows about the internal half float format and will automatically convert to float when read, as has been pointed out twice already.

Thank you very much!

You did me a big favor!

I’m disappointed to see that Cuda 2.2 still doesn’t support this.

Specifically I want to FFT-process a 16bit float RGBA D3D rendertarget texture and then hand it back to D3D. Is there a projected ETA for that?

FYI, there is a nice branchless FP32<->FP16 implementation form Mike Acton available here. Should not be too hard to translate that into CUDA.

Thanks, but I suspect a 256x256 lookup table will be more efficient. But it’s quite a waste when the hardware can already do this.

The difficulty with D3D interop is that D3D textures can’t be accessed as Cuda textures, only by addressing memory directly - so the kernel would need to be able convert between half/full float on read/write, just as Simon suggested. When is this planned, or are there plans to integrate D3D textures more close with Cuda?

I am very interested in this support for half-floats, and a half-float data type available for kernels. It doesn’t look like this feature made it into CUDA 2.3… will this happen soon?

how about using these intrinsics for conversion? The cuda 2.3 beta thread mentioned these as a new feature.

extern device unsigned short __float2half_rn(float);

extern device float __half2float(unsigned short);

Christian

Thank you, these worked pretty well. I haven’t examined in great detail yet, but I assume the 16-bit float that CUDA uses is the same as IEE 754?

http://en.wikipedia.org/wiki/Half_precision

I have read the example simpleTextureDrv. Is there a way to declare without using driver API, example-

texture<half, 2, cudaReadModeElementType> tex;

When declared like above, I see compilation error

1>C:/ProgramData/NVIDIA Corporation/CUDA Samples/v9.0/0_Simple/simpleTexture/simpleTexture.cu(82): error : no instance of overloaded function "tex2D" matches the argument list
1>              argument types are: (texture<half, 2, cudaReadModeElementType>, float, float)

Any other way to declare texture<half,,> besides driver API?

// Allocate array and copy image data
cudaChannelFormatDesc channelDesc = cudaCreateChannelDescHalf();
cudaArray *cuArray;
checkCudaErrors(cudaMallocArray(&cuArray,
                                &channelDesc,
                                width,
                                height));
checkCudaErrors(cudaMemcpyToArray(cuArray,
                                  0,
                                  0,
                                  hData,
                                  size,
                                  cudaMemcpyHostToDevice));

// Set texture parameters
texture<half, 2, cudaReadModeElementType> tex;
tex.addressMode[0] = cudaAddressModeWrap;
tex.addressMode[1] = cudaAddressModeWrap;
tex.filterMode = cudaFilterModeLinear;
tex.normalized = true;    // access with normalized texture coordinates

// Bind the array to the texture
checkCudaErrors(cudaBindTextureToArray(tex, cuArray, channelDesc));

may be of interest:

https://devtalk.nvidia.com/default/topic/547080/cuda-programming-and-performance/-half-datatype-ieee-754-conformance/post/3831088/#3831088

Here is a minimal app using the CUDA runtime that demonstrates access to an FP16 texture. I compiled it with CUDA 8.0.60 on Win64 and ran it on an sm_50 device.

#include <stdlib.h>
#include <stdio.h>

// Macro to catch CUDA errors in CUDA runtime calls
#define CUDA_SAFE_CALL(call)                                            \
 do {                                                                \
    cudaError_t err = call;                                             \
    if (cudaSuccess != err) {                                           \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",  \
                 __FILE__, __LINE__, cudaGetErrorString(err) );         \
        exit(EXIT_FAILURE);                                             \
        }                                                               \
 } while (0)
 
// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR()                                            \
    do {                                                                \
        /* Check synchronous errors, i.e. pre-launch */                 \
        cudaError_t err = cudaGetLastError();                           \
        if (cudaSuccess != err) {                                       \
            fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n", \
                     __FILE__, __LINE__, cudaGetErrorString(err) );     \
            exit(EXIT_FAILURE);                                         \
        }                                                               \
        /* Check asynchronous errors, i.e. kernel failed (ULF) */       \
        err = cudaThreadSynchronize();                                  \
        if (cudaSuccess != err) {                                       \
            fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n", \
                     __FILE__, __LINE__, cudaGetErrorString( err) );    \
            exit(EXIT_FAILURE);                                         \
        }                                                               \
    } while (0)

texture<float, 2> tex;

__global__ void kernel (int m, int n) 
{
    float val;
    for (int row = 0; row < m; row++) {
        for (int col = 0; col < n; col++) {
            val = tex2D (tex, col + 0.5f, row + 0.5f);
            printf ("% 15.8e   ", val);
        }
        printf ("\n");
    }
}

int main (void)
{
    int m = 4; // height = #rows
    int n = 3; // width  = #columns
    size_t pitch, tex_ofs;
    unsigned short arr[4][3]= {{0x0000,0x0001,0x0002},  // zero, denormals
                               {0x3c00,0x3c01,0x3c02},  // 1.0 + eps
                               {0x4000,0x4001,0x4002},  // 2.0 + eps
                               {0x7c00,0x7c01,0x7c02}}; // infinity, NaNs
    unsigned short *arr_d = 0;
    cudaChannelFormatDesc channelDesc = cudaCreateChannelDescHalf(); 
    CUDA_SAFE_CALL(cudaMallocPitch((void**)&arr_d,&pitch,n*sizeof(*arr_d),m));
    CUDA_SAFE_CALL(cudaMemcpy2D(arr_d, pitch, arr, n*sizeof(arr[0][0]),
                                n*sizeof(arr[0][0]),m,cudaMemcpyHostToDevice));
    CUDA_SAFE_CALL (cudaBindTexture2D (&tex_ofs, &tex, arr_d, &channelDesc,
                                       n, m, pitch));
    if (tex_ofs !=0) {
        printf ("tex_ofs = %zu\n", tex_ofs);
        return EXIT_FAILURE;
    }
    kernel<<<1,1>>>(m, n);
    CHECK_LAUNCH_ERROR();
    CUDA_SAFE_CALL (cudaDeviceSynchronize());
    CUDA_SAFE_CALL (cudaFree (arr_d));
    return EXIT_SUCCESS;
}

The output should look something like this:

0.00000000e+00    5.96046448e-08    1.19209290e-07
 1.00000000e+00    1.00097656e+00    1.00195313e+00
 2.00000000e+00    2.00195313e+00    2.00390625e+00
            inf               nan               nan

CUDA 9 may have better support for FP16 textures, I haven’t tried it yet.

@txbob and @njaffa thanks!

That solution worked for me. I got a performance gain. But not by much, because each thread is calling tex3D function 32 times. (Yes, each thread!!!) I need to optimize number of tex reads per thread.

tex* functions are typically very cheap in terms of instructions executed. What you may be up against is limited memory bandwidth. Use the CUDA profiler to pinpoint the bottlenecks in your code. If global memory bandwidth is confirmed as the limiting factor, consider

(1) Use of shared memory as a software-controlled cache

(2) On-the-fly computation as a substitute for data lookup

Performed instruction level profile. tex3D lights up.

Here are visual profiler images
[https://www.dropbox.com/s/elgvsx2833lpt43/UtilizationLevel.bmp?dl=0]
[https://www.dropbox.com/s/2ccjgq74kz2uvr0/kernelLevel.bmp?dl=0]

My code (in case interested):

__global__ void kernel_CalValueAndGrade(
	const float *pCoeff,//coeff for all volumes, COEFF_LENGTH_PERASCAN coefficients per Ascan
	const float *pVol1,//the volume pair
	const float *pVol2,//the volume pair
	const int height, const int width, const int depth,
	float *pOutVals, //the gradient buffer
	float *pOutRes, //the residual buffer
	float sim_weight // a const weight
	)
{
	int z_idx = threadIdx.x; //block size is 1*1*depth/2
	int nOffset = blockIdx.z * width + blockDim.y * blockIdx.y + threadIdx.y; // which A line current block is dealing with

	if (nOffset * HALF_DEPTH + z_idx >= height*width*HALF_DEPTH) return;
	
	extern __shared__ float smem[];
	
	////// ******* iVol 0, Fast X; iVol 1, Fast Y ************* /////
	#pragma unroll
	for(int iVol = 0; iVol < 2; iVol++)
	//////////// Load coeff //////////////
	{	
		//////////// Interpoation along X and Y direction  //////////////
		//now we do the 16 lines calculation //
		// 50% of the kernel time is here
		float3 val1 = make_float3(0.0f, 0.0f, 0.0f); // for position 1
		float3 val2 = make_float3(0.0f, 0.0f, 0.0f); // for position 2
		
		short ix = coeff[48];
		short iy = coeff[49];

		#pragma unroll
		for(short xy_idx=0;xy_idx<16;xy_idx++)
		{	
			short x = xy_idx & 3; // remainder of 4
			short y = xy_idx >> 2; // divide by 4

			short xcoord = ix+x-1;
			short ycoord = iy+y-1;

			float t1, t2;

			// 30% runtime here
			if(iVol==0) // Fast X
			{
				// each threads handle this two positions
				t1 = tex3D(volX, z_idx,				 xcoord, ycoord);
				t2 = tex3D(volX, z_idx + HALF_DEPTH, xcoord, ycoord);
			}
			else // Fast Y
			{
				// each threads handle this two positions
				t1 = tex3D(volY, z_idx,				 xcoord, ycoord);
				t2 = tex3D(volY, z_idx + HALF_DEPTH, xcoord, ycoord);
			}
			
			float c_xy = coeff[xy_idx];
			val1.x += c_xy * t1; //  x * y
			val2.x += c_xy * t2;

			float c_gxy = coeff[xy_idx + 16];
			val1.y += c_gxy * t1; // gx * y
			val2.y += c_gxy * t2;				

			float c_xgy = coeff[xy_idx + 32];
			val1.z += c_xgy * t1; // x * gy
			val2.z += c_xgy * t2;	
		}
	}

	return;
}