cudaDeviceSetSharedMemConfig not working for Tesla K40m

Hi. I have the following transpose with shared memory kernel code that suffers from bank conflicts.

#define TILE_DIM 32
#define BLOCK_ROWS 32

__global__ void transposeCoalesced(float *odata, float *idata, const int nx, const int ny)
{
  __shared__ float tile[TILE_DIM][TILE_DIM];

  int x = blockIdx.x * TILE_DIM + threadIdx.x;
  int y = blockIdx.y * TILE_DIM + threadIdx.y;
  int width = gridDim.x * TILE_DIM;

  for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
     tile[threadIdx.y+j][threadIdx.x] = idata[(y+j)*width + x];

  __syncthreads();

  x = blockIdx.y * TILE_DIM + threadIdx.x;  // transpose block offset
  y = blockIdx.x * TILE_DIM + threadIdx.y;

  for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
     odata[(y+j)*width + x] = tile[threadIdx.x][threadIdx.y + j];
}

The kernel is launched for a 8192x8192 matrix with (256,256,1) blocks and (32,32,1) threads per block. For a shared bank size of 4 bytes, this would ideally result into a 32 way bank conflict during shared loads from the tile while writing to global memory. For a shared bank size of 8 bytes, this would be halved. However, I observe that changing the shared memory bank size has no effect. The final number of loads obtained for bank width of 4 bytes is the same as that of bank width of 4 bytes. For both cases, the number of shared load transactions remains the same. I’m stumped here, since it’s known for devices with compute capability 3.xx, shared memory is configurable. Can anyone help?

What I observe is that the code produces a shared_load_transactions_per_request metric of 16 on a Kepler device, regardless of bank mode, and the same code produces a metric value of 32 on e.g. a cc7.0 device.

To me, the interesting thing here is that the kepler device appears to only provoke a 16-way bank conflict as opposed to a 32-way bank conflict that is evident on the cc7.0 device, regardless of bank mode.

My suggestion would be to read the description of shared memory in cc 3.x carefully:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory-3-0

Note 2 things in particular:

  1. It is stated that each bank has a bandwidth of 64-bits per cycle, before any description of 4-byte (32-bit) vs. 8-byte (64-bit) bank mode is undertaken.

  2. Note that the description of 32-bit mode (4-byte mode) differs noticeably from the description of shared memory for e.g. cc5.x

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory-5-x