What would be the relatioship between K, block_size and grid_size, and N?

I have to solve the following problem:

Implement a kernel for matrix products as outer product vectors. In this version, each block of K threads counts a square piece of the result matrix of size KxK by implementing the matrix outer product formula. The kernel uses shared memory to store the corresponding column vectors from matrix A and matrix B, and to store the corresponding fragment of the result array.

I wrote the following kernel:

#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <iostream>

#define real float
#define IDX(i, j, n) ((i) * (n) + (j))
#define K 32

const int SIZE_1 = 32;
const int SIZE_2 = 1;
const int N = 3;// size of matrices

// The kernel is working correctly.
__global__ void MultiplyGPUv2(real *A, real *B, real *C, int n) {
    __shared__ real sA[K][K];
    __shared__ real sB[K][K];

    int row = blockIdx.y * K + threadIdx.y;
    int col = blockIdx.x * K + threadIdx.x;
    real sum = 0.0;

    for (int r = 0; r < (n + K - 1) / K; r++) {
        if (row < n && (r * K + threadIdx.x) < n) {
            sA[threadIdx.y][threadIdx.x] = A[IDX(row, r * K + threadIdx.x, n)];
        } else {
            sA[threadIdx.y][threadIdx.x] = 0.0;

        if ((r * K + threadIdx.y) < n && col < n) {
            sB[threadIdx.y][threadIdx.x] = B[IDX(r * K + threadIdx.y, col, n)];
        } else {
            sB[threadIdx.y][threadIdx.x] = 0.0;


        for (int i = 0; i < K; i++) {
            sum += sA[threadIdx.y][i] * sB[i][threadIdx.x];


    if (row < n && col < n) {
        C[IDX(row, col, n)] = sum;

int main() 
  real *A, *B, *C;  // Host matrices
  real *d_A, *d_B, *d_C;  // Device matrices

  // Allocate memory for host matrices A, B, and C
  A = (real *)malloc(N * N * sizeof(real));
  B = (real *)malloc(N * N * sizeof(real));
  C = (real *)malloc(N * N * sizeof(real));

  // Initialize host matrices A, B, and C
  for (int i = 0; i < N * N; i++) {
    A[i] = i + 1;
    B[i] = i + 1;
    C[i] = 0;

  // Allocate memory for device matrices A, B, and C
  cudaMalloc((void **)&d_A, N * N * sizeof(real));
  cudaMalloc((void **)&d_B, N * N * sizeof(real));
  cudaMalloc((void **)&d_C, N * N * sizeof(real));

  // Copy data from host to device
  cudaMemcpy(d_A, A, N * N * sizeof(real), cudaMemcpyHostToDevice);
  cudaMemcpy(d_B, B, N * N * sizeof(real), cudaMemcpyHostToDevice);

  // Set the grid and block dimensions
  dim3 block_size(K, K, 1);
  dim3 grid_size((N + K - 1) / K, (N + K - 1) / K);
  //dim3 block_size(SIZE_1, SIZE_2, 1);
  //dim3 grid_size((N + SIZE_1 - 1) / SIZE_1, (N + SIZE_2 - 1) / SIZE_2);

  // Perform matrix multiplication on device
  MultiplyGPUv2<<<grid_size, block_size>>>(d_A, d_B, d_C, N);

  // Copy results from device to host
  cudaMemcpy(C, d_C, N * N * sizeof(real), cudaMemcpyDeviceToHost);

  if(N==3) printMatrix(C, N);

  // Free memory

  return 0;

I have a confusion here.

I am told to change the SIZE_1xSIZE_2=32x1, 64x1, 128x1, … and record the execution time of the kernel. The problem also mentions another value K. If I use block_size=32x1 and k=32, the above listing doesn’t produce correct result for the multiplication.

Is there a formula for the relationship between K and SIZE_1xSIZE_2? And, how do I accommodate N in this relationship?

K is the side dimension of the square tile
N is the side dimension of the square matrices in the formula: C=A*B, where A,B, and C are square matrices of dimension NxN, where each is logically subdivided into KxK tiles.

SIZE_1 is evidently the x dimension of the threadblock, and SIZE_2 is evidently the y dimension of the threadblock.

Ordinarily, for a basic treatment of tiled shared memory matrix multiplication, K would be chosen to be a number like 32 or 16, and SIZE_1 and SIZE_2 would both be chosen to identically match K.

It is possible to decouple the choice of SIZE_1 and SIZE_2 (the threadblock dimensions) from K (the logical tile dimension). A general method to do that would be a 2-dimensional block-stride loop, which is a variant of a grid-stride loop. This would require modification to the kernel code. Thereafter, your grid-sizing and incorporation of N would be as you have shown in the commented-out dim3 ... statements.

1 Like