Matrix Multiplication and Bank conflicts code included

Hey guys.

I am going deeper into CUDA and i read online, that the Matrix Multiplication example with shared memory has bank conflicts and one cheap way of solving it is by padding the shared memory array with an extra column:


I did that and expected better results BUT i got worse results.

Below is my code. It compiles in 1 .cu file but you need to add your own timer. Could anyone confirm my results and tell me why i am getting worse results when everyone online is claiming different?

(I used pragma unroll 1 so that compiler does not unroll the loops for better comparison).

My times using: Bs[BLOCK_SIZE][BLOCK_SIZE] : 1414 milliseconds

            Bs[BLOCK_SIZE][BLOCK_SIZE + 1] : 1446 milliseconds

If anyone could explain what im doing wrong and confirm my results i would really appreciate it.

Thanks a lot!

#include <iostream>

//#include "cuda_time.h"

using namespace std;

#define BLOCK_SIZE 16

#define N 2048

__global__ void kernel2(int* A, int* B, int* C, int size)


	__shared__ int As[BLOCK_SIZE][BLOCK_SIZE];

	__shared__ int Bs[BLOCK_SIZE][BLOCK_SIZE + 1];     // <---   TRY WITH BLOCK_SIZE, and BLOCK_SIZE + 1.

	int tx = threadIdx.x;

	int ty = threadIdx.y;

	int bx = blockIdx.x;

	int by = blockIdx.y;

	int row = by * BLOCK_SIZE + ty;

	int col = bx * BLOCK_SIZE + tx;

	int value = 0;

#pragma unroll 1

	for(int m=0; m < size/BLOCK_SIZE; m++)


		As[ty][tx] = A[row * size + (m * BLOCK_SIZE + tx)];

		Bs[ty][tx] = B[col + (m * BLOCK_SIZE + ty) + size];


#pragma unroll 1

		for(int k=0; k < BLOCK_SIZE; k++)


			value += As[ty][k] * Bs[k][tx];




	C[row * size + col] = value;


void go()


	int* A = new int[N * N];

	int* B = new int[N * N];

	int* C = new int[N * N];

	for(int i=0; i < N * N; i++)


		A[i] = i;

		B[i] = i;

		C[i] = i;



	int* a;

	int* b;

	int* c;

	cudaMalloc((void**)&a, N * N * sizeof(int));

	cudaMalloc((void**)&b, N * N * sizeof(int));

	cudaMalloc((void**)&c, N * N * sizeof(int));

	cudaMemcpy(a,A, N * N * sizeof(int), cudaMemcpyHostToDevice);

	cudaMemcpy(b,B, N * N * sizeof(int), cudaMemcpyHostToDevice);

	cudaMemcpy(c,C, N * N * sizeof(int), cudaMemcpyHostToDevice);

	dim3 threads(BLOCK_SIZE,BLOCK_SIZE);



	//cuda_time ct;							<----  INSERT YOUR OWN TIMER CODE 





	//cout << "TIME IS:\t" << ct.get_time() << endl;


int main()



	cout << "EXIT" << endl;


	return 0;


When I tried your code, I got 72 ms for the version with BLOCK_SIZE + 1, but 70 ms with BLOCK_SIZE (if I increased BLOCKSIZE to 32, I got 66 ms).

If I compiled with -arch=sm_20 however, I got 147 ms with BLOCK_SIZE + 1 and 122 with BLOCK_SIZE. By removing #pragma unroll 1, the -arch=sm_20 got as fast as the version without -arch=sm_20, i.e. 72 / 70 ms. Removing #pragma unroll 1 in the case without -arch=sm_20 did not make any difference. I looked at the ptx code, but the loop was not unrolled there for the case without -arch=sm_20 but with #pragma unroll 1. Why does the code take twice the time to run for -arch=sm_20 if #pragma unroll 1 is used? (I tested the code on a Geforce 480 GTX card with cuda 3.2)

Regarding the bank conflicts, I do not see them in the inner loop. The index if Bs[k][threadIdx.x] should give that the data end up in different memory banks , compared to Bs[threadIdx.x][k], which would give bank conflicts. For As[k][threadIdx.y] the same value should be used for all 16 threads, i.e. no bank conflict either. For fermi cards, you have 2 way bank conflicts with BLOCK_SIZE=16, which probably is why it was faster for me with BLOCK_SIZE=32

Oh, so are you suggesting that even with

Bs[BLOCK_SIZE][BLOCK{SIZE] there are NO bank conflicts?

Yes, try changing the indices to Bs[tx][k] (will of course give invalid results), which generates bank conflicts. Then you can try and use Bs[BLOCK_SIZE][BLOCK_SIZE] or Bs[BLOCK_SIZE][BLOCK_SIZE+1] to see the difference.

Strictly speaking, new cards have 32 memory banks, which means that with BLOCK_SIZE=16, two threads will access the same memory bank, but not all of them.