Hi. I have the following transpose with shared memory kernel code that suffers from bank conflicts.
#define TILE_DIM 32
#define BLOCK_ROWS 32
__global__ void transposeCoalesced(float *odata, float *idata, const int nx, const int ny)
{
__shared__ float tile[TILE_DIM][TILE_DIM];
int x = blockIdx.x * TILE_DIM + threadIdx.x;
int y = blockIdx.y * TILE_DIM + threadIdx.y;
int width = gridDim.x * TILE_DIM;
for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
tile[threadIdx.y+j][threadIdx.x] = idata[(y+j)*width + x];
__syncthreads();
x = blockIdx.y * TILE_DIM + threadIdx.x; // transpose block offset
y = blockIdx.x * TILE_DIM + threadIdx.y;
for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
odata[(y+j)*width + x] = tile[threadIdx.x][threadIdx.y + j];
}
The kernel is launched for a 8192x8192 matrix with (256,256,1) blocks and (32,32,1) threads per block. For a shared bank size of 4 bytes, this would ideally result into a 32 way bank conflict during shared loads from the tile while writing to global memory. For a shared bank size of 8 bytes, this would be halved. However, I observe that changing the shared memory bank size has no effect. The final number of loads obtained for bank width of 4 bytes is the same as that of bank width of 4 bytes. For both cases, the number of shared load transactions remains the same. I’m stumped here, since it’s known for devices with compute capability 3.xx, shared memory is configurable. Can anyone help?