Need help understanding kernel function, grid and block

Hello, I’m new to cuda proramming and recently wrote a toy version of matrix multiplication. I have a bit issue with making it work on sizes greater than 1024.

Full code: #include <stdio.h>#include <stdlib.h>#include "../common.h"__global__ - Pastebin.com

Sample (the printf is there to make things obvious):

__global__ void matmul(int **a, int **b, int **c, int n) {
    printf("%d\t%d\n", blockIdx.x, threadIdx.x);
    int i = blockIdx.x, j = threadIdx.x;
    c[i][j] = 0;
    for (int k = 0; k < n; k++) {
        c[i][j] += a[i][k] * b[k][j];
    }
}

and in main I call matmul<<<n, n>>>(a, b, c, n) where a, b, and c are 2d array and n is the size.

It all works fine n <= 1024, where the kernel is been run and a lot of printing. As soon as n gets to 1025, there is no printing and the kernel is not run.

I’m running this with cuda 11.2 on ubuntu 18.04.

Appreciate any helps! Thanks

The basic structure of a kernel’s call is kernel_name<<<number of block, number of threads per block>>> so you are launching in total number of blocks*number of threads per block . By other way the max number of threads per block is 1024 in most of the case so this can be your problem. I recommend you to set number of threads per block to 256, it usually works fine and compute the number of blocks as function of your array and the number of threads per block.

1 Like

Thank you. A follow up question, is this documented anywhere that I’m missing?

When you say the max number of threads per block is 1024, does that matter whether the threads are in 1D, 2D or 3D? For that dim3, is 1024 the total product or per axis maximum?

Yes, 1024 is the total product of the 3 axis. Dimensions only matter in order to make the code easy to write or read it. The same applies to the number of blocks.

1 Like