I haven’t found a definitive answer for this: I know GPUs have been optimized with many FPUs and transcendental functions to make floating point very fast, but is it faster than using fixed point? For example, if I only needed 5 bits of precision for fairly small numbers, that can easily be held in a 32-bit register. Is it faster for the GPU to do computations from a LUT of integers, or to do FP computations and call transcendentals?

I have not tried doing much fixed point arithmetic on the GPU, but I have used fixed point numbers to reduce the amount of data that needs to move from device memory to the GPU. Once the data was loaded into GPU registers, I converted it back to floating point to do arithmetic. This was a big improvement for memory bandwidth-limited applications.

For your case, where the lookup-table for a transcendental function would be very small, I think it could be faster than doing floating point arithmetic directly. Keep in mind that CUDA multiprocessors have special hardware for computing transcendental functions with reduced precision, which can outperform a larger lookup table.

I think it is quite unlikely that a fixed-point implementation would be faster. The code will need some form of addressing math to access the LUT, which means additional instructions. For what it is worth, the MUFU instructions for the transcendental functions provided by the GPU hardware already use table lookup (plus quadratic interpolation in fixed-point arithmetic).

I would suggest to simply use single-precision arithmetic, and compiling the code with -use_fast_math since your use case doesn’t require last-bit accuracy.

For the 5-bit case mentioned, I imagined using the fixed point number directly as an unsigned offset into an array of precomputed function values. However, I agree it is much better to use the fast_math versions of the floating point functions, especially since many calculations will have intermediate valves that should be computed with higher precision than the fixed point inputs.

Depending on the use case, the following two approaches may be worthy of consideration:

(1) Store data in textures, and use texture interpolation to replace explicit computation. Best I know the hardware interpolation utilizes 9-bit fixed-point arithmetic, so this is a low accuracy approach. I used this previously with single-precision floating-point data, but see little benefit from this with Kepler-based GPUs. Your mileage may vary.

(2) Use 16-bit floating-point data. Note that in CUDA “half float” is purely a storage format and not a computational format. However it can provide a significant reduction in memory bandwidth requirements. Some details are covered in the following forum thread: https://devtalk.nvidia.com/default/topic/547080/-half-datatype-ieee-754-conformance/