Using lookup table in constant memory

tajiknomi · December 28, 2018, 11:22am

I have an array which contains 48 elements with each element of size 4 bytes. The array is read-only for both the host and the device.

For the purpose of optimization, i thought i should declare the array as global constant, but according to the cuda documentation

“the constant cache is best when threads in the same warp accesses only a few distinct locations. If all threads of a warp access the same location, then constant memory can be as fast as a register access”

But in my case every thread in a warp will access random element of the array, so most probably the cache-miss will occur.

The same array is used by both the host function and the device kernel.

In which manner can i declare this read-only array so that only one definition is used by both (host and device).

Because currently i have two definition of the same lookup table, one for the device in , say kernel.cu; and the other for the host in host.c

Is their any other approach besides constant which can somehow benefit me in the performance gain in both (host and device).

tera · December 28, 2018, 11:48am

Constant memory indeed is a poor choice for the reason you cited. Shared memory would be a much better location.
A lookup table of up to 32 elements can even be placed in registers and accessed via __shfl_sync() (using one register of each thread and keeping separate tables for each warp).
Larger lookup tables that do not fit into shared memory can be placed in a texture, although nowadays there is little difference vs just keeping them in global memory.

In order to avoid duplication you can cudaMemcpy() the table from the host.
If the table is listed literally in the source code, you could also use a #define to avoid duplication.

njuffa · December 28, 2018, 4:59pm

When there is non-uniform access to the constant cache across a warp, the problem is not cache misses, but serialization. The constant cache can serve one chunk of data per cycle, but has a broadcast feature that can supply that data to all threads in a warp in parallel. If multiple different addresses are presented across the warp, data for these will be served in consecutive cycles until all requests are satisfied (serialization).

Empirically, for up to three different addresses presented across the warp, putting such an array into constant memory is often still the best choice; otherwise use the approach recommended by tera. Since it is relatively easy to misjudge the amount of intra-warp address divergence (been there, done that, got the t-shirt), I would suggest prototyping both solutions and running a quick experiment.

Yet another alternative might be to replace this small table by computation. Sometimes standards will represent a functional relationship as a table which can also be expressed as a simple function. For example, the beta table in H.264 in-loop deblocking (table 8-16 in my PDF copy) expresses a simple piece-wise linear relationship.

tajiknomi · January 10, 2019, 12:03pm

Both advice’s are helpful.

It would not be feasible to use registers for lookup table because its almost 256 bytes in size.
I have tried the shared memory approach for the lookup table and the results are acceptable. Though i am a little confused about why the shared load transaction per access is greater then the ideal value i.e. 1, in my case. Because each entry of my lookup table is 4 bytes in size so there shouldn’t be any alignment issues while reading at entries level (4 bytes level).

Bank conflicts shouldn’t be an issue because warp-threads are just READING the values from shared memory (Almost Randomly) and if more then one thread within the warp try to read the same bank, it should just be broadcast (not serialized). Correct me if i’m missing something here.

Here is the attached screenshot of shared memory access pattern from my profiler Dropbox - SM.png - Simplify your life

Robert_Crovella · January 10, 2019, 12:16pm

more than 1 transaction per request will come about if the values are not organized one per bank.

bank conflicts can occur on reads as well as writes.

If two threads access the same location, then broadcast will occur. But if two threads access two separate values in the same bank, then bank conflicts will occur.

tajiknomi · January 10, 2019, 1:03pm

In my case, more then one thread within the warp access different values in the same bank.

Got it.
Thank you :)

Topic		Replies	Views
__constant__ use CUDA Programming and Performance	6	15341	June 14, 2008
Constant memory vs shared memory LUT CUDA Programming and Performance	4	5795	April 22, 2008
Best way to allocate a small lookup table 2KB of data, read only CUDA Programming and Performance	7	2951	March 22, 2011
Small const array accessable globally? Is it easy and possible? CUDA Programming and Performance	6	1536	April 16, 2009
Really slow constant memory Random access to constant memory CUDA Programming and Performance	13	4666	December 4, 2009
Constant Arrays CUDA Programming and Performance	13	30820	November 24, 2007
Warp Serialisation and Constant Memory Performance Surprise CUDA Programming and Performance	7	4003	March 3, 2009
Constant memory usage and comparison against textures CUDA Programming and Performance	9	4216	December 24, 2008
const __restrict__ read faster than __constant__ ? CUDA Programming and Performance	17	6282	February 26, 2014
Should I use constant memory or Texture? CUDA Programming and Performance	8	11671	February 20, 2008

Using lookup table in constant memory

Related topics