Need help understanding kernel function, grid and block

Hello, I’m new to cuda proramming and recently wrote a toy version of matrix multiplication. I have a bit issue with making it work on sizes greater than 1024.

Full code: #include <stdio.h>#include <stdlib.h>#include "../common.h"__global__ - Pastebin.com

Sample (the printf is there to make things obvious):

__global__ void matmul(int **a, int **b, int **c, int n) {
    printf("%d\t%d\n", blockIdx.x, threadIdx.x);
    int i = blockIdx.x, j = threadIdx.x;
    c[i][j] = 0;
    for (int k = 0; k < n; k++) {
        c[i][j] += a[i][k] * b[k][j];
    }
}

and in main I call matmul<<<n, n>>>(a, b, c, n) where a, b, and c are 2d array and n is the size.

It all works fine n <= 1024, where the kernel is been run and a lot of printing. As soon as n gets to 1025, there is no printing and the kernel is not run.

I’m running this with cuda 11.2 on ubuntu 18.04.

Appreciate any helps! Thanks

The basic structure of a kernel’s call is kernel_name<<<number of block, number of threads per block>>> so you are launching in total number of blocks*number of threads per block . By other way the max number of threads per block is 1024 in most of the case so this can be your problem. I recommend you to set number of threads per block to 256, it usually works fine and compute the number of blocks as function of your array and the number of threads per block.

1 Like

Thank you. A follow up question, is this documented anywhere that I’m missing?

When you say the max number of threads per block is 1024, does that matter whether the threads are in 1D, 2D or 3D? For that dim3, is 1024 the total product or per axis maximum?

Yes, 1024 is the total product of the 3 axis. Dimensions only matter in order to make the code easy to write or read it. The same applies to the number of blocks.

1 Like

Thank you. A follow up question, is this documented anywhere that I’m missing?

In the CUDA Programming Guide Section on Thread Hierarchy it states

There is a limit to the number of threads per block, since all threads of a block are expected to reside on the same processor core and must share the limited memory resources of that core. On current GPUs, a thread block may contain up to 1024 threads.

The limits are listed in the CUDA Programming Chapter on Compute Capabilities.

In the table you will find the rows such as the following:

Technical Specification Value for CC x.y
Maximum dimensionality of a thread block 3
Maximum x- or y-dimension of a block 1024
Maximum z-dimension of a block 64
Maximum number of threads per block 1024