price for lookup tables

Hello

My project involves a bunch of small GF(2^8) field multiplications and a bunch of xoring between them. Besides from making it easier to implement, I’m wondering if that would help efficiency.

Would it be possible to use some OpenGL calls to move these fairly big lookup tables(about 9 8-bit arrays of 256 values) into a “read only” section? Since each block would be using these tables could it still be efficiently parallel?

I have a hard time of judging when lookup tables are a good idea, any general advice? I didn’t notice a ton of info in the manual.

Thanks

If each thread will be accessing the same table entry at the same time, then constant memory can be very effective. There is 8 kB of constant cache per multiprocessor, so your tables would be fully cached here.

If threads will access the tables divergently, then a texture will be better. Textures also get 8 kB of cache per multiprocessor, so all of your tables would eventually be fully cached here as well.

Another option for the divergent read case is to put the table in shared memory, but I would try the texture first if you need shared memory for something else, like read-write scratch space for the block.

Note that none of these options involve OpenGL calls. You can do them all direct through CUDA. To copy to constant memory, see the example at the top of pg. 34 in the CUDA 1.1 Programming Guide.

Lookup tables are only a good idea if the values can’t be calculated on the fly. Otherwise, it’s usually better to do some extra math than to do memory accesses. GPUs like compute-heavy kernels.

But where is the line?

At some point that is false. I don’t have enough experience to know when I’m close or not.

Thanks

If performance is really important then coding it up both ways and comparing results is what I do. This is generally not difficult, and the differences are still surprising enough that I find it worth while.

There is no substitute for experimentation, as any “line” is bound to be implementation dependent (as has been said before).

I will go out on a limb and say ~50 math operations as a line, though. This is under the assumption that this is a “worst case” usage of a lookup table value (i.e. it is read every time in an inner loop done 100’s of times in each thread on the GPU). If it is a value read once before the inner loop, you can get away with more.

Indeed, it depends on various things, like the memory-heavyness of the rest of your kernel.
If the lookup tables are small enough they fit into the texture cache. This is very fast (but might still be slower than math) see also http://forums.nvidia.com/index.php?showtopic=59048

The main problem I encountered when thinking about lookup tables vs on the fly calculation is register usage which in some cases directly influences occupancy, e.g. occupancy goes down when calculating on the fly.

It’s always a trade of and as you guys said there’s no substitute for experiment.