Odd observation about ordering of __shared__ memory arrays

Have noticed this in a couple of kernels, where I have a number of shared memory arrays of constant size.

For example something like this;

__shared__ int a,b;
__shared__ float2 d;
__shared__ int f;
__shared__ int idx_locs[131];
__shared__ float2 val_locs[131];

There seems to be an ‘optimal’ permutation of those arrays which can make a modest difference in running times.

Also the order does directly affect the amount of shared memory used for the kernel (can see this metric via nvcc --verbose) even though the amount of memory declared does not change.

ptxas info    : Used 32 registers, 9120 bytes smem, 392 bytes cmem[0], 4 bytes cmem[2]

or this when the only change is re-ordering from the top the memory arrays;

1>  ptxas info    : Used 32 registers, 9112 bytes smem, 392 bytes cmem[0], 4 bytes cmem[2]

Those arrays are essentially ‘read-only’ after an initial __syncthreads() to fill the values. Some values are read often, while other only a few times.
In the above example the array ‘idx_locs’ is read the most frequently by all threads in a block, but not in a coalesced or ‘bank-conflict-free’ pattern.

When I change the ordering to some random permutation like this;

__shared__ int idx_locs[131];
__shared__ int a,b;
__shared__ float2 d;
__shared__ int f;
__shared__ float2 val_locs[131];

I get a consistent modest 1% increase in performance for a wide range of inputs. This also seems to be correlated with the amount of shared used, with higher amounts associated with slightly higher performance.
I am guessing the compiler is attempting to align or pad some memory based on that order, but that is just a guess. Not a huge impact on performance, but an impact none the less.
Not a big issue, just curious as to what may be going on during compilation.

Keep in mind that all data on the GPU must be naturally aligned, and that built-in types like ‘float2’ therefore require increased alignment, which may require padding which may slightly increase total storage requirements. To minimize total storage, try declaring data in the order of decreasing sizeof() of their base types. Here: first the ‘float2’ variable, followed by ‘int’ variables.

Where caches are involved, it would also make sense to group “hot” storage locations together, followed by “cold” storage locations. This may conflict with the goal of minimizing storage.