Have noticed this in a couple of kernels, where I have a number of shared memory arrays of constant size.
For example something like this;
__shared__ int a,b;
__shared__ float2 d;
__shared__ int f;
__shared__ int idx_locs[131];
__shared__ float2 val_locs[131];
There seems to be an ‘optimal’ permutation of those arrays which can make a modest difference in running times.
Also the order does directly affect the amount of shared memory used for the kernel (can see this metric via nvcc --verbose) even though the amount of memory declared does not change.
ptxas info : Used 32 registers, 9120 bytes smem, 392 bytes cmem[0], 4 bytes cmem[2]
or this when the only change is re-ordering from the top the memory arrays;
1> ptxas info : Used 32 registers, 9112 bytes smem, 392 bytes cmem[0], 4 bytes cmem[2]
Those arrays are essentially ‘read-only’ after an initial __syncthreads() to fill the values. Some values are read often, while other only a few times.
In the above example the array ‘idx_locs’ is read the most frequently by all threads in a block, but not in a coalesced or ‘bank-conflict-free’ pattern.
When I change the ordering to some random permutation like this;
__shared__ int idx_locs[131];
__shared__ int a,b;
__shared__ float2 d;
__shared__ int f;
__shared__ float2 val_locs[131];
I get a consistent modest 1% increase in performance for a wide range of inputs. This also seems to be correlated with the amount of shared used, with higher amounts associated with slightly higher performance.
I am guessing the compiler is attempting to align or pad some memory based on that order, but that is just a guess. Not a huge impact on performance, but an impact none the less.
Not a big issue, just curious as to what may be going on during compilation.