Load and store half-floats from device memory. How to shift those bits correctly...

Hi all,
I’m running out of GPU ram for my algorithms and was thinking that half-floats offer more than enough precision for all I need.

I am using the driver API, so I know I can read half-float textures, but I need a way to write and also read half-floats to/from device linear memory. I took a quick look at the PTX manual and saw lots of intrinsicts for the half float conversions.

Why are there no CUDA functions for these?

I guess that through some smart bit shifting and the like the conversion can also be done “manually”.
But I’m not sure how.

Can someone help?


For manual conversions I would read the format description, e.g. over here: http://en.wikipedia.org/wiki/Half_precision.

Since half-float description above is short and authors asssumed you know how floats looks like I would suggest checking out
to undestand how full floats are represented.

Since device uses similar formats as on the host, I would first try to implement succesfull conversions on the host, before playing with CUDA.

From what I know CUDA differs in:

  • no support for denormalised values (very small, close to 0 values)
  • no or different support for incorrect values (NaNs)
  • not sure how ± infinity is handled.

For fast float <-> half float conversions on the CPU, check out this article



No promises, but half-float to float conversion intrinsics are planned for CUDA 2.3

I just learned more than I ever wanted of floating point representation…

Yes, looking at the complexity of handling the denormals, etc correctly I realize it’s not just a matter of shifting bits.
Knowing it’s all in there in PTX already does not really motivate to tackle this manually.

So I’ll cross my fingers this comes into CUDA 2.3 - thanks for promising it Simon :)

Thank you all for your kind help.

BTW: Just in case someone wondered why I’m getting out of RAM… I’m working on a Mac with the GT8800 that only has 512MB. The Quadro with 1.5GB is way too expensive for me. If I where a Windows user I would certainly have already solved that…

CUDA does not support denormals anyway, and if you code your program carefully you can also avoid infinites, NaNs, leaving only bit shifting for your code.
So if you don’t want to wait for CUDA 2.3 I would try that simplest shifting algorithm anyway and keep my fingers crossed :)

Alternatively, maybe there are other ways of decreasing your memory requirements?

I was so lost into those floating point docs that I oversaw that info from your first post.

Leaving only the bit shifts seems like it’s worth trying it out.

Many thanks for pointing that out to me again!

(But I am still crossing fingers…)

Actually, the double-precision instructions do support subnormal inputs and results, while single-precision instructions flush subnormal inputs and results to zero.

(PTX ISA 1.4 manual p.55)


Ah… I never worked with doubles, sorry.