shared memory without bank conflict slower than that with bank conflict

I am following the example code of simple matrix transpose. As in the following code, shmemTransposeKernel has bank conflicts, confirmed with the performance analysis in visual studio, but it is always (for matrix size 2^9, 2^10, 2^11) faster than the optimalTransposeKernel where I added a padding for the shared memory to avoid bank conflict.

blockSize (64, 16), gridSize (n/64, n/64) where n is the size of the square matrix.

Also both of them are much slower than using naive CPU to transpose, or GPU without using shared memory…

The consumed time:
Size 512 naive CPU: 0.002048 ms
Size 512 GPU memcpy: 0.025600 ms
Size 512 naive GPU: 0.109568 ms
Size 512 shmem GPU: 0.250880 ms
Size 512 optimal GPU: 0.959488 ms

Size 1024 naive CPU: 0.002048 ms
Size 1024 GPU memcpy: 0.091136 ms
Size 1024 naive GPU: 0.400384 ms
Size 1024 shmem GPU: 1.121280 ms
Size 1024 optimal GPU: 3.559424 ms

Size 2048 naive CPU: 0.002048 ms
Size 2048 GPU memcpy: 0.348160 ms
Size 2048 naive GPU: 1.708032 ms
Size 2048 shmem GPU: 3.823616 ms
Size 2048 optimal GPU: 12.526592 ms

I am very confused by why this is happening… my GPU is geforce gtx 1050 Ti.

__global__
void shmemTransposeKernel(const float *input, float *output, int n) {

	__shared__ float data[64][64]; // bank conflict

	const int i = threadIdx.x + 64 * blockIdx.x;
	int j = 4 * threadIdx.y + 64 * blockIdx.y;
	const int end_j = j + 4;
	// copy data to shared memory
	for (; j < end_j; j++)
		data[threadIdx.y * 4 + 4 - (end_j - j)][threadIdx.x] = input[j * n + i];
	__syncthreads();
	// copy data from shared memory to output
	int outi = threadIdx.x + 64 * blockIdx.y;
	int outj = 4 * threadIdx.y + 64 * blockIdx.x;
	int end_outj = outj + 4;
	for (; outj < end_outj; outj++)
		output[outj * n + outi] = data[threadIdx.x][threadIdx.y * 4 + 4 - (end_outj - outj)];
}

__global__
void optimalTransposeKernel(const float *input, float *output, int n) {

	__shared__ float data[64][65]; // add padding to avoid bank conflict

	const int i = threadIdx.x + 64 * blockIdx.x;
	int j = 4 * threadIdx.y + 64 * blockIdx.y;
	const int end_j = j + 4;
	// copy data to shared memory
	for (; j < end_j; j++)
		data[threadIdx.y * 4 + 4 - (end_j - j)][threadIdx.x] = input[j * n + i];
	__syncthreads();
	// copy data from shared memory to output
	int outi = threadIdx.x + 64 * blockIdx.y;
	int outj = 4 * threadIdx.y + 64 * blockIdx.x;
	int end_outj = outj + 4;
	for (; outj < end_outj; outj++)
		output[outj * n + outi] = data[threadIdx.x][threadIdx.y * 4 + 4 - (end_outj - outj)];
}

are you building a debug project?

@Robert_crovella, you are right. That resolve my issue…
And the CPU elapsed time was a bug since I did not count the CPU time properly.