Confirm that dynamically allocated __shared__ memory is just as fast as the statically allocated variety?

If I dynamically allocated shared memory in a kernel, and do all the type alignment correctly (I have both int and long long int __shared__ arrays, so I will choose a 32-bit type and be sure to have the long long int arrays point to even numbered indices of the base __extern__ allocation), can anyone just confirm for me that this will produce the same performance as statically allocated __shared__ memory? I have realized that I need to make one of my kernels launchable as 1 x 1024, 2 x 512, 4 x 256, or even 8 x 128 per streaming multiprocessor for best results. The current setup statically allocates 24 - 36 kB of __shared__ per block for up to 1024 atoms, so reducing this to accommodate the thread block size is critical for staging multiple blocks on the same SM.

And, the syntax for any one CUDA unit is simply to declare

extern __shared__ int shmem_array[];

__global__ __launch_bounds__(MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_SM)
void myFirstKernel(const int stride) {
  int* pointer_a = shmem_array[0];
  int* pointer_b = shmem_array[stride];
  int* pointer_c = shmem_array[2 * stride];

__global__ __launch_bounds__(MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_SM)
void mySecondKernel(const int stride) {
  long long int* lptr_a = shmem_array[0];
  long long int* lptr_b = shmem_array[2 * stride];
  int* regular_ptr_c = shmem_array[4 * stride];

It seems, from typing that out, that even if the baseline performance of accesses to a statically allocated array and a dynamically allocated array might be similar, it may require more registers to keep track of the pointers into the dynamically allocated array, or (if I were to store pointer_a, pointer_b, and pointer_c as __shared__ values themselves) additional accesses to __shared__ just to get the pointers to the __shared__ memory I want to access. I can compile more variants of the kernel, for the typical “large block” case and then a “small block” case at the other end of the spectrum. That would cover the vast majority of what’s relevant.

Wow, this gets quite involved… and every choice carries a degree of bookkeeping to manage.