Fastest way to implement small lookup table

What is the fastest way to implement a small lookup table (for example, less than 100 values), without any specific pattern in the distribution of the values?

I tried storing the list in a constant array and doing a constant memory lookup, and doing an inline function containing a switch statement. The constant memory lookup version seems to be faster. (But with OpenCL in some cases, the switch statement seemed to be faster.)

Is there a better way for this, maybe with inline PTX, and without storing the values into memory separate from the executable binary? Maybe with an alternative version for the case that al threads in a warp will access the same value.

There is not the one true way. There is a design space that you can explore. The profiler will help you with finding the best solution for your specific use case.

Are these floating-point values, or integers you need to look up? If the latter what’s the range of values that need to be stored? What’s the locality of access?

Your design space looks roughly as follows:

(0) Replace table with computation if relatively few instructions are needed
(1) Use in-register lookup for a few values that are small in magnitude (e.g. sixteen 4-bit values)
(2) Use constant memory, especially if the accesses are relatively uniform across a warp (rule of thumb: on average, no more than three different entries accessed across a warp)
(3) Use shared memory
(4) Use global memory