4D float array storage

I need to have all threads access a common 4D lookup table. Access is random from all threads. Devices are multiple Tesla K10s.
The LUT data is normalized floating point, less than 50 nodes for each dimension. I will be doing the interpolation manually in CUDA code. I have working host version code already.

I have found that, at least for 1D and 2D integer data, simultaneous access is much faster when texture memory is used for storage. Table dimensions are small, typically 33.
( float LutABCD[33][33][33][33] in host memory)

Question - Is there any best way to fold the table into lower dimensional array, specifically, is
float simLUT_AB_CD [3333] [3333] better than
float simlut_ABC_D [33] [333333] or even
float simLUT_ABCD [333333*33] or anything else?

To my knowledge, and someone correct me if I’m wrong, it is faster to deal with 1D arrays always… so my suggestion is to flatten it to 1 dimension, like so: https://plus.google.com/104038699355103594931/posts/3wipTEpMUun

You can define a macro so that it’s not a complete pain to recalculate your indexes as talonmies’ response shows: http://stackoverflow.com/questions/5631115/2d-array-on-cuda

Using the first link, it should be straight-forward to extend the macro above into 4 dimensions.

La vaca not so loca!
Thanks for both. That’s what I was planning to do.
The macro does make the code readable.
The reduce operations comment won’t be needed since the dimensions are known at compile time.

If no one corrects you, I’m going to assume you are right about 1D flattening being faster.
It makes sense. Tnx.

Why don’t you consider using CUDA’s float4? It is a struct defined as

struct __device_builtin__ __builtin_align__(16) float4
{
    float x, y, z, w;
};

and the declaration would read as

float4 simLUT[33];

builtin_align(16) requests the compiler to allocate the struct (or array of such structs) on a 16-byte aligned boundary to improve coalescence.

It’s probably not a problem, but your randomly-accessed 4D LUT is ~4.5 MB so it can’t fit entirely in shared or L2.

If the LUT throughput is the bottleneck you might consider finding a more compact way of representing the LUT so it will fit in either texture, shared or L2.