Fast Implementation of (Small-)Table Lookup

Hi,

In my use case, I need a (very) efficient implementation of table[index]. Notably,

  • table is relatively small. Think somewhere along the line of 4 - 32 entries of 16-bit, although it can also be up to 256 values if needed to be.
  • index is, for the most part, randomly distributed between all entries in the table.

How should I implement such table lookup? I’ve done a fair bit of Googling / prototyping, and learned that:

  1. Registers-backed table do not support dynamic indexing (would be spilled).
  2. Shared-memory-backed table could work. However, it would run into a lot of bank conflicts if the table span more than 32 banks, though there are ways to get around the conflicts. Still, this implies a lot of memory accesses, and might not be ideal for my case.

Is Option-2 (with the trick to reduce bank conflicts) the best I could get? Is there any possible way to perform such look-up using just registers? This post suggested that __shfl_sync might offer an even better way if the table has <=32 elements. However, I’m not very familiar with warp-level intrinsics so can’t judge the feasibility/performance of this choice.

Any help/pointer is appreciated — thanks in advance for your time!

If table access is (mostly) uniform across a warp, __constant__ memory seems like the appropriate choice. You could also try to put it in global memory and let the cache hierarchy work its magic: small footprint and hotspot characteristic is the best case for caches (requiring “very efficient” access implies that the table is accessed frequently). I have not used __shfl_sync to implement a table, so I cannot speak to that variant.

Thinking outside the box: Could the table lookup be replaced with on-the fly recomputation? Arithmetic operations are cheap.

If this were my code I would simply prototype all the alternatives and profile them. With an hour of work you would then have hard data in hand to make an informed decision.

Thank you so much for the information! If you don’t mind, I’d like to have a couple of follow-up questions.

cache: Unfortunately, the access pattern is non-uniform (and can’t be re-computed efficiently). In that case, how would the cache behavior differ if the table is backed by __constant__ vs __shared__? This post suggested that shared memory is almost always preferred when compared against global/constant memory, would you agree with this statement as well?

__shfl_sync: By any chance you are aware of the speed/latency numbers between (non-bank-conflicted) shared memory read vs __shfl_sync? The latter is usually used to facilitate inter-lane communication. Hence, the superior performance is stated against shared memory read + write (not read alone). In practice, I did notice a slight slow-down, but there are some extra factors so I’m not sure about that.

Thanks again for your prompt response!

Why engage in thought experiments when it is easy enough to perform a set of actual experiments that tests all design variants in the actual given context? If the experimental findings are surprising, one can (and should) still dig in with the help of the profiler to understand why things work out the way they do. That helps with forming a mental model for future reference. In my experience expert speculation based on incomplete information is wrong 50% of the time.

__shared__ has no intersections with caches.

the most important aspect of __constant__ cache efficiency has already been pointed out: uniform access across a warp.

Shared memory has a throughput of 32 bits per bank, per clock, per SM. warp shuffle has a throughput of one instruction warp-wide (so 32 thread ops) per SM per clock, excepting the case of cc7.5 where the throughput is half that.

Some applications that make heavier use of shared memory (e.g. for reductions) may benefit by moving some of the reduction/shared load to warp-shuffle.

Thanks both for the detailed responses!

@njuffa totally agree — I’m in the process of trying things out.

@Robert_Crovella sorry I’m a bit new to this space, and am not very familiar with the nuances between “32 bits per bank-clock-SM” vs “one warp-wide instruction per SM-clock”. Are you suggesting that one shared memory read is more or less similar to one warp-shuffle instruction?

Not speaking for Robert, but that looks like the case. Shared memory, (conflict free), will have a higher latency, around 19 cycles, (Volta, Turing), 22 cycles, (Ampere). Have not found the __shufl latency, but not likely to be this high.

Yes. For either one, 32 threads per clock per SM can be serviced. Since warp shuffle moves data thread-to-thread, it might be more comparable to two shared memory ops (a load and a store) so careful thought/analysis may be needed.

I personally would not start out with a focus on latency. Latency only matters if you are latency bound. On a GPU, an overarching objective is to avoid latency bound situations. Expose enough parallelism/work.

Understood, thanks again for all the helps there!!

How about copying the table 32 times so that each thread within a warp (= lane) can access its own bank? Then you can guarantee no bank conflicts. If you store your 16-bit values as 32-bit values (4 bytes) and have the maximum table size stated of 256 and store the table 32 times (for each bank or lane), then your shared memory requirement is 4 * 256 * 32 = 32 KB, which is available.

__shared__ int lut[256][32]; // table[index][threadIdx.x & 31]

The shared memory approach probably is most-straightforward and fastest in your case.

(If each thread does more than one lookup, you could also sort the lookups according to bank number and/or group of 64 entries (32 bits * 32 banks / 16 bits). With sorting them back afterwards. However that can get very involved without using any indices and keeping the threads non-diverging.)

With some tricks, the register based approach would work, too. (However size of 256 may a bit high for it.) C++ could need some convincing not to use local memory:

Either do

switch (index) {
case 0: result = table[0]; break;
case 1: result = table[1]; break;
...
}

When only using compile-time known indices like above (or within unrolled loops), the local variables (here table) can be stored in registers.

Or combine lots of ? operators

// example with 16 entries; with up to 256 entries, it starts to get unhandy (you will have to copy lots of unneeded data)
// for each 4 bytes a separate select instruction is inserted
int4 table0; // lower 8 entries
int4 table1; // upper 8 entries

int4 step0 = bit0 ? table1 : table0;
int2 step1 = bit1 ? {step0.y, step0.w} : {step0.x, step0.y};
short4 step1s = reinterpret_cast<short&>(step1); // bit_cast, if it works, could be less UB. But both should work in practice; in theory memcpy could be used instead, too.
short2 step2 = bit2 ? {step1s.y, step1s.w} : {step1s.x, step1s.y};
short result_step3 = bit3 ? step2.y : step2.x;

Advantage of the register implementations is that they do not need SM wide resources like shared memory and shuffle. So could be faster at full occupancy.

A theoretical third option would be the tensor cores with matrices with mostly 0 and 1 at the position of the index. But would probably be slower in this case.

You could implement the shared memory for up to 32 values (or even 64 16 bit values, packed) without worry of bank conflicts, and without any duplicate tables. For the 256 entry case, with packing you can cut the memory requirement in half.

Thanks for the comments @Robert_Crovella @Curefab!

For the 256 entry case, with packing you can cut the memory requirement in half.

Could you elaborate what you meant by “packing”? Is this referring to store 2 of the 16-bit table entries as 1 32-bit entry?

If you store your 16-bit values as 32-bit values (4 bytes)

Maybe even more naively, why would I not want do this per the comment from @Curefab — i.e., why would I want to store 16-bit values as 32-bit values?

Yes. If the storage is as an array of 16-bit values, this “packing” will take place “automatically”.

Ah got it, thanks for the super fast reply :)!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.