Hello,

I have a question concerning choice of memory for a static LUT. The LUT size is 8KB. Currently I have implemented the LUT in constant memory, as I thought this would be the fastest way to do it.

The code which is using the LUT is

```
#pragma unroll
for(unsigned i=0; i < NUM_SIGMA_GROUPS; ++i)
{
fP = 1.0f;
#pragma unroll
for(unsigned j=0; j < NUM_CAMS; ++j)
{
fP = fP * ( CLASS_ERROR * Gamma_Prior[i]
+ ONE_MINUS_CLASS_ERROR *
(fabs(P_Gamma[i][j] - LocalBlock[uiFromIDX +
j*uiStride])) );
}
```

float P_Gamma and float Gamma_Prior are LUTS, where P_Gamma is the LUT of interest here. The runtime for my kernel is about 33ms. If I change the array excess of P_Gamma to a constant value like 1, because all entries in the LUT are 0 or 1, the algorithm performance drops to 12ms. Therefore I am assuming that the access time of P_Gamma is really slow. Now my question, can I expect from shared memory to be faster than my current use of constant memory?

Thanks

Christoph