Matrix Multiplication and Bank conflicts code included

thecoder3 · April 16, 2012, 1:37pm

Hey guys.

I am going deeper into CUDA and i read online, that the Matrix Multiplication example with shared memory has bank conflicts and one cheap way of solving it is by padding the shared memory array with an extra column:

i.e Bs[BLOCK_SIZE][BLOCK_SIZE + 1]

I did that and expected better results BUT i got worse results.

Below is my code. It compiles in 1 .cu file but you need to add your own timer. Could anyone confirm my results and tell me why i am getting worse results when everyone online is claiming different?

(I used pragma unroll 1 so that compiler does not unroll the loops for better comparison).

My times using: Bs[BLOCK_SIZE][BLOCK_SIZE] : 1414 milliseconds

            Bs[BLOCK_SIZE][BLOCK_SIZE + 1] : 1446 milliseconds

If anyone could explain what im doing wrong and confirm my results i would really appreciate it.

Thanks a lot!

#include <iostream>

//#include "cuda_time.h"

using namespace std;

#define BLOCK_SIZE 16

#define N 2048

__global__ void kernel2(int* A, int* B, int* C, int size)

{

	__shared__ int As[BLOCK_SIZE][BLOCK_SIZE];

	__shared__ int Bs[BLOCK_SIZE][BLOCK_SIZE + 1];     // <---   TRY WITH BLOCK_SIZE, and BLOCK_SIZE + 1.

	int tx = threadIdx.x;

	int ty = threadIdx.y;

	int bx = blockIdx.x;

	int by = blockIdx.y;

	int row = by * BLOCK_SIZE + ty;

	int col = bx * BLOCK_SIZE + tx;

	int value = 0;

#pragma unroll 1

	for(int m=0; m < size/BLOCK_SIZE; m++)

	{

		As[ty][tx] = A[row * size + (m * BLOCK_SIZE + tx)];

		Bs[ty][tx] = B[col + (m * BLOCK_SIZE + ty) + size];

		__syncthreads();

#pragma unroll 1

		for(int k=0; k < BLOCK_SIZE; k++)

		{

			value += As[ty][k] * Bs[k][tx];

		}

		__syncthreads();

	}

	C[row * size + col] = value;

}

void go()

{

	int* A = new int[N * N];

	int* B = new int[N * N];

	int* C = new int[N * N];

	for(int i=0; i < N * N; i++)

	{

		A[i] = i;

		B[i] = i;

		C[i] = i;

	}

	//***********************************8

	int* a;

	int* b;

	int* c;

	cudaMalloc((void**)&a, N * N * sizeof(int));

	cudaMalloc((void**)&b, N * N * sizeof(int));

	cudaMalloc((void**)&c, N * N * sizeof(int));

	cudaMemcpy(a,A, N * N * sizeof(int), cudaMemcpyHostToDevice);

	cudaMemcpy(b,B, N * N * sizeof(int), cudaMemcpyHostToDevice);

	cudaMemcpy(c,C, N * N * sizeof(int), cudaMemcpyHostToDevice);

	dim3 threads(BLOCK_SIZE,BLOCK_SIZE);

	dim3 grid(N/BLOCK_SIZE,N/BLOCK_SIZE);

	

	//cuda_time ct;							<----  INSERT YOUR OWN TIMER CODE 

	//ct.start();

	kernel2<<<grid,threads>>>(a,b,c,N);

	cudaThreadSynchronize();

	//ct.stop();

	//cout << "TIME IS:\t" << ct.get_time() << endl;

}

int main()

{

	go();

	cout << "EXIT" << endl;

	cin.get();

	return 0;

}

A.Goude · April 16, 2012, 4:06pm

When I tried your code, I got 72 ms for the version with BLOCK_SIZE + 1, but 70 ms with BLOCK_SIZE (if I increased BLOCKSIZE to 32, I got 66 ms).

If I compiled with -arch=sm_20 however, I got 147 ms with BLOCK_SIZE + 1 and 122 with BLOCK_SIZE. By removing #pragma unroll 1, the -arch=sm_20 got as fast as the version without -arch=sm_20, i.e. 72 / 70 ms. Removing #pragma unroll 1 in the case without -arch=sm_20 did not make any difference. I looked at the ptx code, but the loop was not unrolled there for the case without -arch=sm_20 but with #pragma unroll 1. Why does the code take twice the time to run for -arch=sm_20 if #pragma unroll 1 is used? (I tested the code on a Geforce 480 GTX card with cuda 3.2)

Regarding the bank conflicts, I do not see them in the inner loop. The index if Bs[k][threadIdx.x] should give that the data end up in different memory banks , compared to Bs[threadIdx.x][k], which would give bank conflicts. For As[k][threadIdx.y] the same value should be used for all 16 threads, i.e. no bank conflict either. For fermi cards, you have 2 way bank conflicts with BLOCK_SIZE=16, which probably is why it was faster for me with BLOCK_SIZE=32

thecoder3 · April 16, 2012, 4:17pm

Oh, so are you suggesting that even with

Bs[BLOCK_SIZE][BLOCK{SIZE] there are NO bank conflicts?

A.Goude · April 16, 2012, 4:31pm

Yes, try changing the indices to Bs[tx][k] (will of course give invalid results), which generates bank conflicts. Then you can try and use Bs[BLOCK_SIZE][BLOCK_SIZE] or Bs[BLOCK_SIZE][BLOCK_SIZE+1] to see the difference.

Strictly speaking, new cards have 32 memory banks, which means that with BLOCK_SIZE=16, two threads will access the same memory bank, but not all of them.

Topic		Replies	Views
Better performance with smaller block size & bank conflicts !!! CUDA Programming and Performance	2	545	September 23, 2017
Bank conflicts confusion for tiled matrix multiplication CUDA Programming and Performance	1	319	February 27, 2024
Matrix Multiplication Bank conflicts problem GPU-Accelerated Libraries	17	1939	August 31, 2018
Bank conflict of tiled matrix multiplication CUDA Programming and Performance cuda , kernel	1	689	May 17, 2023
Shared memory matrix multiplication not working CUDA Programming and Performance	6	130	October 11, 2024
example project "transpose" CUDA Programming and Performance	1	2040	March 13, 2009
shared memory without bank conflict slower than that with bank conflict CUDA Programming and Performance	2	942	November 28, 2019
How to implement extensive matrix - vector multiplications with small matrices but large vectors CUDA Programming and Performance	7	6192	April 4, 2011
2 problems with my matrix multiplication code CUDA Programming and Performance	5	2034	April 14, 2012
Resolve 1D shared memory bank conflict with paddling CUDA Programming and Performance cuda , kernel	9	290	September 1, 2024

Matrix Multiplication and Bank conflicts code included

Related topics