Hello, I’m new to cuda proramming and recently wrote a toy version of matrix multiplication. I have a bit issue with making it work on sizes greater than 1024.
Sample (the printf is there to make things obvious):
__global__ void matmul(int **a, int **b, int **c, int n) {
printf("%d\t%d\n", blockIdx.x, threadIdx.x);
int i = blockIdx.x, j = threadIdx.x;
c[i][j] = 0;
for (int k = 0; k < n; k++) {
c[i][j] += a[i][k] * b[k][j];
}
}
and in main I call matmul<<<n, n>>>(a, b, c, n) where a, b, and c are 2d array and n is the size.
It all works fine n <= 1024, where the kernel is been run and a lot of printing. As soon as n gets to 1025, there is no printing and the kernel is not run.
The basic structure of a kernel’s call is kernel_name<<<number of block, number of threads per block>>> so you are launching in total number of blocks*number of threads per block . By other way the max number of threads per block is 1024 in most of the case so this can be your problem. I recommend you to set number of threads per block to 256, it usually works fine and compute the number of blocks as function of your array and the number of threads per block.
Thank you. A follow up question, is this documented anywhere that I’m missing?
When you say the max number of threads per block is 1024, does that matter whether the threads are in 1D, 2D or 3D? For that dim3, is 1024 the total product or per axis maximum?
Yes, 1024 is the total product of the 3 axis. Dimensions only matter in order to make the code easy to write or read it. The same applies to the number of blocks.
In the CUDA Programming Guide Section on Thread Hierarchy it states
There is a limit to the number of threads per block, since all threads of a block are expected to reside on the same processor core and must share the limited memory resources of that core. On current GPUs, a thread block may contain up to 1024 threads.