Static rounding modes

Hello,

I’m currently working on a small and simple Cuda interval library (a bachelor thesis) and need some help with the rounding modes.

The Programming Guide states:

and later on in the appendix:

My question is, how can I “statically” set the rounding mode to “round-towards-zero”?

I know, there are C intrinsics with all 4 rounding modes. Yet being intrinsics/functions they are very slow.

And a little bit off-topic:

I tried to multiply large float numbers with __fmul_rz,

e.g. __fmul_rz(2x10^32, 2x10^32)

and the expected result should be +infinity, however the result is the number “below infinity”,

i.e the highest possible float: 3.4028235x10^38

It seems as __fmulr_rz would “round down” infinity. I’m not sure if this is my fault or not, because I am using the 3.0 SDK Debug Emulator (which is deprecated)

Thanks for any help!

The statement cited above applies to single-precision addition and multiplication on sm_1x hardware. For sm_2x platforms, single-precision addition, multiplication, and fused-multiply-add with all four IEEE rounding modes are supported directly in hardware. To achieve a uniform interface at the CUDA C level, fadd_ru(), fadd_rd(), fmul_ru(), and fmul_rd() are emulated in software for sm_1x platforms and therefore slow.

The following paper explains how to create an interval library for sm_1x platforms using only the round-to-zero and round-to-nearest rounding modes supported by hardware:

http://hal.archives-ouvertes.fr/hal-00263670/

Sylvain Collange, Jorge Flórez, David Defour

A GPU interval library based on Boost.Interval

8th Conference on Real Numbers and Computers, Santiago de Compostela : Spain (2008)

Thanks Norbert for the citation. :)

Concerning the second question,

This is the expected answer according to the IEEE-754 standard. Note that this is also consistent with interval arithmetic.

If your multiplication example is computed using interval arithmetic (single-precision), the resulting interval will be [3.4028235x10^38, +infinity]. It contains the exact result 4x10^64.

Returning [+infinity, +infinity] in this case would break the containment property (assuming we manage to properly define what [+inf,+inf] means…)

If you are only targeting sm_20 platforms, you can also have a look at the Interval sample in the CUDA SDK 3.2.

Thank you both very much for the replies! :)
It all makes sense now.