shared memory without bank conflict slower than that with bank conflict

zyjdxtc · November 28, 2019, 1:10am

I am following the example code of simple matrix transpose. As in the following code, shmemTransposeKernel has bank conflicts, confirmed with the performance analysis in visual studio, but it is always (for matrix size 2^9, 2^10, 2^11) faster than the optimalTransposeKernel where I added a padding for the shared memory to avoid bank conflict.

blockSize (64, 16), gridSize (n/64, n/64) where n is the size of the square matrix.

Also both of them are much slower than using naive CPU to transpose, or GPU without using shared memory…

The consumed time:
Size 512 naive CPU: 0.002048 ms
Size 512 GPU memcpy: 0.025600 ms
Size 512 naive GPU: 0.109568 ms
Size 512 shmem GPU: 0.250880 ms
Size 512 optimal GPU: 0.959488 ms

Size 1024 naive CPU: 0.002048 ms
Size 1024 GPU memcpy: 0.091136 ms
Size 1024 naive GPU: 0.400384 ms
Size 1024 shmem GPU: 1.121280 ms
Size 1024 optimal GPU: 3.559424 ms

Size 2048 naive CPU: 0.002048 ms
Size 2048 GPU memcpy: 0.348160 ms
Size 2048 naive GPU: 1.708032 ms
Size 2048 shmem GPU: 3.823616 ms
Size 2048 optimal GPU: 12.526592 ms

I am very confused by why this is happening… my GPU is geforce gtx 1050 Ti.

__global__
void shmemTransposeKernel(const float *input, float *output, int n) {

	__shared__ float data[64][64]; // bank conflict

	const int i = threadIdx.x + 64 * blockIdx.x;
	int j = 4 * threadIdx.y + 64 * blockIdx.y;
	const int end_j = j + 4;
	// copy data to shared memory
	for (; j < end_j; j++)
		data[threadIdx.y * 4 + 4 - (end_j - j)][threadIdx.x] = input[j * n + i];
	__syncthreads();
	// copy data from shared memory to output
	int outi = threadIdx.x + 64 * blockIdx.y;
	int outj = 4 * threadIdx.y + 64 * blockIdx.x;
	int end_outj = outj + 4;
	for (; outj < end_outj; outj++)
		output[outj * n + outi] = data[threadIdx.x][threadIdx.y * 4 + 4 - (end_outj - outj)];
}

__global__
void optimalTransposeKernel(const float *input, float *output, int n) {

	__shared__ float data[64][65]; // add padding to avoid bank conflict

	const int i = threadIdx.x + 64 * blockIdx.x;
	int j = 4 * threadIdx.y + 64 * blockIdx.y;
	const int end_j = j + 4;
	// copy data to shared memory
	for (; j < end_j; j++)
		data[threadIdx.y * 4 + 4 - (end_j - j)][threadIdx.x] = input[j * n + i];
	__syncthreads();
	// copy data from shared memory to output
	int outi = threadIdx.x + 64 * blockIdx.y;
	int outj = 4 * threadIdx.y + 64 * blockIdx.x;
	int end_outj = outj + 4;
	for (; outj < end_outj; outj++)
		output[outj * n + outi] = data[threadIdx.x][threadIdx.y * 4 + 4 - (end_outj - outj)];
}

Robert_Crovella · November 28, 2019, 2:48am

are you building a debug project?

zyjdxtc · November 28, 2019, 3:41am

@Robert_crovella, you are right. That resolve my issue…
And the CPU elapsed time was a bug since I did not count the CPU time properly.

Topic		Replies	Views
Does shared memory bank conflict really affect the performance? Shared memory bank conflict, perform CUDA Programming and Performance	2	1131	July 13, 2010
Help understanding bank conflicts in transpose example CUDA Programming and Performance	5	6798	February 8, 2009
How to understand the bank conflict of shared_mem CUDA Programming and Performance	16	14213	November 19, 2025
Share Memory Bank Conflict - No conflict is slower than all conflict? CUDA Programming and Performance	2	1378	March 7, 2019
Why there is random bank conflicts? CUDA-MEMCHECK cuda	2	1255	September 19, 2023
Accessing Shared memory: Unexpected timing result "Naive" access seems to be twice as fast a CUDA Programming and Performance	3	1226	February 27, 2012
example project "transpose" CUDA Programming and Performance	1	2040	March 13, 2009
Newbie: __shared__, what am I doing wrong? shared memory seems to be slowing things down - why? CUDA Programming and Performance	2	1844	November 21, 2009
Shared memory avoiding bank conflict less effective CUDA Programming and Performance	3	3762	May 6, 2010
Shared memory bank conflicts? CUDA Programming and Performance	0	861	June 4, 2009

shared memory without bank conflict slower than that with bank conflict

Related topics