Constant memory vs shared memory LUT

Hello,
I have a question concerning choice of memory for a static LUT. The LUT size is 8KB. Currently I have implemented the LUT in constant memory, as I thought this would be the fastest way to do it.

The code which is using the LUT is

   #pragma unroll
for(unsigned i=0; i < NUM_SIGMA_GROUPS; ++i)
{
	fP = 1.0f;
	
	#pragma unroll
	for(unsigned j=0; j < NUM_CAMS; ++j)
	{
	
		fP = fP * ( CLASS_ERROR * Gamma_Prior[i]    
                                    + ONE_MINUS_CLASS_ERROR *
			        (fabs(P_Gamma[i][j] -  LocalBlock[uiFromIDX + 
                                      j*uiStride])) );
   }

float P_Gamma and float Gamma_Prior are LUTS, where P_Gamma is the LUT of interest here. The runtime for my kernel is about 33ms. If I change the array excess of P_Gamma to a constant value like 1, because all entries in the LUT are 0 or 1, the algorithm performance drops to 12ms. Therefore I am assuming that the access time of P_Gamma is really slow. Now my question, can I expect from shared memory to be faster than my current use of constant memory?

Thanks
Christoph

As discussed here previously, constant memory is not optimal unless all threads are accessing the same address. Your best bet for these kind of look-up tables is to use a 1D texture bound to global memory, accessed via tex1Dfetch().

Constant memory is optimized for broadcast access, where all the threads are reading the same value. So it is not suitable for lookup tables.

Shared memory is a better fit for LUT usage, though 8K is awfully big. (esp. if you are using shared memory for anything else).

If the data really must reside in a LUT, I would look to texture - if you specify the “linear” filter mode, TEX can perform the interpolation that your kernel appears to be doing in code. (But note, TEX interpolation is limited to 9 bits precision.)

But if the data can be computed, you might be surprised at how efficient it is to just compute the value. LUT-based optimizations that work well on CPUs have mixed results on GPUs.

Hello,
thanks for the answers. It is unfortunately not possible to compute the lookup entries on the fly. What I am doing in the code is calculation of group specific probabilities. where each t P_Gamma entry is 0 or 1 which specifies if the code has to use the probability or counter probability.
What I like to mention here is that the table lookups are processed sequentially in the two for loops. Therefore all threads should read the same table entry at the same time. This was at least my intention when writing the code.
I have a __syncthreads() in front of the loops but this does not make any difference in runtime performance as the threads do not diverge.
It is not clear to me if in my case where all threads read the same constant memory location at a time the code should be fast. Has somebody an answer to this?
If this should not be the case I will try 1D textures.

Thanks
Christoph

A broadcast load from constant memory is reasonably efficient, but it is not completely without cost, especially for values not already in the constant cache. In particular, it looks like you hit every value from the LUT in each block, with no reuse within the thread. I think you would do better to have the threads load the LUT into shared memory cooperatively, then read the LUT from shared memory. Since 8KB is rather large, you might want to divide the calculation into multiple loops so that you can load just part of the table on each pass and reuse the shared memory on each subsequent pass.

Also, you should consider the calculation that populates the LUT. If each entry requires less than a few dozen operations to calculate (and doesn’t require too much additional data), then you might want to calculate the LUT values instead of loading them from memory. Again, you would give each LUT entry to a thread to calculate, deposit the value into shared memory, and then proceed with the rest of your calculation, using the LUT values from shared memory.

-rpl