I need to have all threads access a common 4D lookup table. Access is random from all threads. Devices are multiple Tesla K10s.
The LUT data is normalized floating point, less than 50 nodes for each dimension. I will be doing the interpolation manually in CUDA code. I have working host version code already.
I have found that, at least for 1D and 2D integer data, simultaneous access is much faster when texture memory is used for storage. Table dimensions are small, typically 33.
( float LutABCD in host memory)
Question - Is there any best way to fold the table into lower dimensional array, specifically, is
float simLUT_AB_CD   better than
float simlut_ABC_D   or even
float simLUT_ABCD [333333*33] or anything else?