CUDA Matrix Multiplication Performance

I have developed two programs for matrix multiplication using shared memory. I have used the Tile Size 16 x 16 in both the programs. In one program I have made static allocation for shared memory and in the other program I have made dynamic allocation for the shared memory. I have passed the SIZE of shared memory as kernel argument like matmul<<grid,thread,Size>>(…)

Issue is that, when I am comparing the performance of these two programs, I observe that I am getting better speed up where I have made static allocation for shared memory. The program where I have made dynamic allocation for shared memory does not give that much performance i.e. speed up obtained is much less than the former one.

I can not understand… why?? Can anyone please help me regarding this?

I believe your conclusion is flawed, possibly because you have made a mistake of some sort.

I don’t witness any such discrepancy.

Here is a fully worked test case, comparing both usages (dynamically allocated shared memory and statically allocated shared memory) based on the code presented in the programming guide:

$ cat
#include <stdio.h>
// DSIZE should be evenly divisible by BLOCK_SIZE
#define DSIZE 2048

#define AROW DSIZE
#define ACOL DSIZE
#define BCOL DSIZE

#define TEST_VAL 1.0f

// Thread block size
#define BLOCK_SIZE 16

#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;

typedef float sft[BLOCK_SIZE];

// Matrices are stored in row-major order:
// M(row, col) = *(M.elements + row * M.stride + col)
typedef struct {
    int width;
    int height;
    int stride;
    float* elements;
} Matrix;

// Get a matrix element
__device__ float GetElement(const Matrix A, int row, int col)
    return A.elements[row * A.stride + col];

// Set a matrix element
__device__ void SetElement(Matrix A, int row, int col,
                           float value)
    A.elements[row * A.stride + col] = value;

// Get the BLOCK_SIZExBLOCK_SIZE sub-matrix Asub of A that is
// located col sub-matrices to the right and row sub-matrices down
// from the upper-left corner of A
 __device__ Matrix GetSubMatrix(Matrix A, int row, int col)
    Matrix Asub;
    Asub.width    = BLOCK_SIZE;
    Asub.height   = BLOCK_SIZE;
    Asub.stride   = A.stride;
    Asub.elements = &A.elements[A.stride * BLOCK_SIZE * row
                                         + BLOCK_SIZE * col];
    return Asub;

// Forward declaration of the matrix multiplication kernel
__global__ void MatMulKernel(const Matrix, const Matrix, Matrix);

// Matrix multiplication - Host code
// Matrix dimensions are assumed to be multiples of BLOCK_SIZE
void MatMul(const Matrix A, const Matrix B, Matrix C)
    // Load A and B to device memory
    Matrix d_A;
    d_A.width = d_A.stride = A.width; d_A.height = A.height;
    size_t size = A.width * A.height * sizeof(float);
    cudaMalloc(&d_A.elements, size);
    cudaMemcpy(d_A.elements, A.elements, size,
    Matrix d_B;
    d_B.width = d_B.stride = B.width; d_B.height = B.height;
    size = B.width * B.height * sizeof(float);
    cudaMalloc(&d_B.elements, size);
    cudaMemcpy(d_B.elements, B.elements, size,

    // Allocate C in device memory
    Matrix d_C;
    d_C.width = d_C.stride = C.width; d_C.height = C.height;
    size = C.width * C.height * sizeof(float);
    cudaMalloc(&d_C.elements, size);

    // Invoke kernel
    dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
    dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y);
#ifndef SDYNAMIC
    MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C);
    MatMulKernel<<<dimGrid, dimBlock,sizeof(float)*BLOCK_SIZE*BLOCK_SIZE*2>>>(d_A, d_B, d_C);

    // Read C from device memory
    cudaMemcpy(C.elements, d_C.elements, size,

    // Free device memory

// Matrix multiplication kernel called by MatMul()
 __global__ void MatMulKernel(Matrix A, Matrix B, Matrix C)
    // Block row and column
    int blockRow = blockIdx.y;
    int blockCol = blockIdx.x;

    // Each thread block computes one sub-matrix Csub of C
    Matrix Csub = GetSubMatrix(C, blockRow, blockCol);

    // Each thread computes one element of Csub
    // by accumulating results into Cvalue
    float Cvalue = 0;

    // Thread row and column within Csub
    int row = threadIdx.y;
    int col = threadIdx.x;

    // Loop over all the sub-matrices of A and B that are
    // required to compute Csub
    // Multiply each pair of sub-matrices together
    // and accumulate the results
    for (int m = 0; m < (A.width / BLOCK_SIZE); ++m) {

        // Get sub-matrix Asub of A
        Matrix Asub = GetSubMatrix(A, blockRow, m);

        // Get sub-matrix Bsub of B
        Matrix Bsub = GetSubMatrix(B, m, blockCol);

        // Shared memory used to store Asub and Bsub respectively
#ifndef SDYNAMIC
        __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
        __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
        extern __shared__ float smem[];
        sft *As = (sft *)(smem);
        sft *Bs = (sft *)(smem+BLOCK_SIZE*BLOCK_SIZE);

        // Load Asub and Bsub from device memory to shared memory
        // Each thread loads one element of each sub-matrix
        As[row][col] = GetElement(Asub, row, col);
        Bs[row][col] = GetElement(Bsub, row, col);

        // Synchronize to make sure the sub-matrices are loaded
        // before starting the computation
        // Multiply Asub and Bsub together
        for (int e = 0; e < BLOCK_SIZE; ++e)
            Cvalue += As[row][e] * Bs[e][col];

        // Synchronize to make sure that the preceding
        // computation is done before loading two new
        // sub-matrices of A and B in the next iteration

    // Write Csub to device memory
    // Each thread writes one element
    SetElement(Csub, row, col, Cvalue);

int main(){

  Matrix A,B,C;

  A.height = AROW;
  A.width  = ACOL;
  B.height = ACOL;
  B.width  = BCOL;
  C.height = AROW;
  C.width  = BCOL;

  A.elements = new float[A.height*A.width]();
  B.elements = new float[B.height*C.width]();
  C.elements = new float[C.height*C.width]();

  for (int i = 0; i < A.height*A.width; i++) A.elements[i] = TEST_VAL;
  for (int i = 0; i < B.height*B.width; i++) B.elements[i] = TEST_VAL;
  unsigned long dt = dtime_usec(0);
  dt = dtime_usec(dt);

  for (int i = 0; i < C.height*C.width; i++) if (C.elements[i] != (A.width*TEST_VAL*TEST_VAL)) {printf("mismatch at %d, was: %f, should be: %f\n", i, C.elements[i], (float)(A.width*TEST_VAL*TEST_VAL)); return 1;}
  printf("elapsed time: %f\n", dt/(float)USECPSEC);
  return 0;
$ nvcc -O3 -o t995
$ ./t995
elapsed time: 0.810462
$ nvcc -O3 -o t995 -DSDYNAMIC
$ ./t995
elapsed time: 0.809575

I just run you code… It will give me the following output…

For DSIZE 1024

arnab@arnab-Inspiron-3442:~/Downloads$ ./a.out
elapsed time: 0.112122
arnab@arnab-Inspiron-3442:~/Downloads$ ./a.out -DSDYNAMIC
elapsed time: 0.114594

For DSIZE 512

arnab@arnab-Inspiron-3442:~/Downloads$ ./a.out
elapsed time: 0.053405
arnab@arnab-Inspiron-3442:~/Downloads$ ./a.out -DSDYNAMIC
elapsed time: 0.054052

For DSIZE 256

arnab@arnab-Inspiron-3442:~/Downloads$ ./a.out
elapsed time: 0.038804
arnab@arnab-Inspiron-3442:~/Downloads$ ./a.out -DSDYNAMIC
elapsed time: 0.047208

Dynamic allocation takes more time… I have GeForce820M. And the main thing is that your code will not run for non-square matrix… I have given AROW 100, ACOL 120 and BCOL 200… It gives me following result…

arnab@arnab-Inspiron-3442:~/Downloads$ ./a.out
mismatch at 0, was: 112.000000, should be: 120.000000

My code will run for both square as well as non-square matrices…

./a.out -DSDYNAMIC is not how you run the dynamic shared version. You have to recompile.

And yes, the code only works with square matrices. It was not intended to be a perfect matrix multiplication project, but merely to address the issue of static vs. dynamic allocation, since you had provided no code or example whatsoever to support your assertion. There isn’t any significant difference in performance, and I believe the example I’ve offered demonstrates that assertion.

Differences of ~1% in runtime are in the noise. There is more than that much variability run-to-run, so we can’t make any judgements about a difference of ~1% in timing.