Need help understanding kernel function, grid and block

xu_bin_han · March 29, 2021, 4:07am

Hello, I’m new to cuda proramming and recently wrote a toy version of matrix multiplication. I have a bit issue with making it work on sizes greater than 1024.

Full code: #include <stdio.h>#include <stdlib.h>#include "../common.h"__global__ - Pastebin.com

Sample (the printf is there to make things obvious):

__global__ void matmul(int **a, int **b, int **c, int n) {
    printf("%d\t%d\n", blockIdx.x, threadIdx.x);
    int i = blockIdx.x, j = threadIdx.x;
    c[i][j] = 0;
    for (int k = 0; k < n; k++) {
        c[i][j] += a[i][k] * b[k][j];
    }
}

and in main I call matmul<<<n, n>>>(a, b, c, n) where a, b, and c are 2d array and n is the size.

It all works fine n <= 1024, where the kernel is been run and a lot of printing. As soon as n gets to 1025, there is no printing and the kernel is not run.

I’m running this with cuda 11.2 on ubuntu 18.04.

Appreciate any helps! Thanks

alvarodd · March 29, 2021, 11:44am

The basic structure of a kernel’s call is kernel_name<<<number of block, number of threads per block>>> so you are launching in total number of blocks*number of threads per block . By other way the max number of threads per block is 1024 in most of the case so this can be your problem. I recommend you to set number of threads per block to 256, it usually works fine and compute the number of blocks as function of your array and the number of threads per block.

xu_bin_han · March 29, 2021, 4:27pm

Thank you. A follow up question, is this documented anywhere that I’m missing?

When you say the max number of threads per block is 1024, does that matter whether the threads are in 1D, 2D or 3D? For that dim3, is 1024 the total product or per axis maximum?

alvarodd · March 29, 2021, 4:58pm

Yes, 1024 is the total product of the 3 axis. Dimensions only matter in order to make the code easy to write or read it. The same applies to the number of blocks.

arora2deepak11 · April 20, 2021, 11:37am

Thank you. A follow up question, is this documented anywhere that I’m missing?

Greg · April 27, 2021, 6:06pm

In the CUDA Programming Guide Section on Thread Hierarchy it states

There is a limit to the number of threads per block, since all threads of a block are expected to reside on the same processor core and must share the limited memory resources of that core. On current GPUs, a thread block may contain up to 1024 threads.

The limits are listed in the CUDA Programming Chapter on Compute Capabilities.

In the table you will find the rows such as the following:

Technical Specification	Value for CC x.y
Maximum dimensionality of a thread block	3
Maximum x- or y-dimension of a block	1024
Maximum z-dimension of a block	64
Maximum number of threads per block	1024

Topic		Replies	Views
Kernel Question CUDA Programming and Performance	3	4721	March 4, 2012
help with some cuda programming CUDA Programming and Performance	9	1818	August 31, 2009
Operation result depend on number of threads? CUDA Programming and Performance	2	469	May 6, 2014
New findings needed to be verified: Maximum thread block is not 1024 in K20 CUDA Programming and Performance	4	754	November 17, 2014
Help with block size and block numbers CUDA Programming and Performance	3	1776	November 26, 2009
Maximum number of threads on thread block CUDA Programming and Performance	12	73779	September 21, 2023
Max threads/block CUDA Programming and Performance	10	22209	March 7, 2011
Large Thread Size prevents Kernel from running CUDA Programming and Performance	8	925	May 16, 2011
What is the maximum number of blocks I can use? CUDA Programming and Performance	3	2665	February 8, 2022
Thread Block Size CUDA Programming and Performance	1	858	September 17, 2009

Need help understanding kernel function, grid and block

Related topics