Cuda kernel not executing all threads

__global__ void laplacian(double *u, double *out, int Nx, int Ny, double h) {

    int gindex_x = threadIdx.x + blockIdx.x * blockDim.x;
    int gindex_y = threadIdx.y + blockIdx.y * blockDim.y;
    int lindex_x = threadIdx.x;
    int lindex_y = threadIdx.y;
    double h12h2 = 12.0*h*h;

    //printf("blockIdx.x: %d\n",blockIdx.x);
    //printf("blockDim.x: %d\n",blockDim.x);
    if(gindex_x + Nx*gindex_y > 1048570){
        printf("gindex: %d\n",gindex_x + Nx*gindex_y);
    }

}

#include "lclv.h"
extern "C"

{

const int BLOCKSIZEX = 32; // number of threads per block along dim 1
const int BLOCKSIZEY = 32; // number of threads per block along dim 2

void exec_kernel(double *u_d, double *out, int Nx, int Ny, double h){

	// compute (2D) execution configuration
	int nBlocksX = Nx / BLOCKSIZEX; // how many blocks along dim 1
	int nBlocksY = Ny / BLOCKSIZEY; // how many blocks along dim 2
	dim3 dimBlock(BLOCKSIZEX, BLOCKSIZEY); // set values
	dim3 dimGrid(nBlocksX, nBlocksY);

	// call lclv kernel
	laplacian <<< dimGrid, dimBlock >>> (u_d, out, Nx, Ny, h);
	cudaDeviceSynchronize();
}
}

I am new to CUDA, so I am trying to learn. The above code prints out nothing at all. I would have expected it to print something, as there should be 1024x1024 threads. Without the if statement, the kernel only prints out ~5000 statements, not 1024x1024 as I would expect. Why? There is other code which fill the arrays etc, I have left out for clarity.

  • Anytime you post code here, please format it properly. When you are editing your question, paste your code, select the code, and then click the </> button in the toolbar at the top of the edit window. Then save your changes. I have done it for you this time.
  • Asterisk in text causes the asterisk to disappear and a switch to italics. I suggest using another character for multiplication, I have switched your asterisks to x
  • I usually suggest people provide a short complete code. That means you should create the shortest version of your code that still demonstrates the issue, and is complete so that others can copy, paste, compile, and run your code, without having to add anything or change anything. This has benefits such as: increases clarity, reduces Q+A/back-and-forth, and immediately allows others to use powerful tools to diagnose and perhaps help you quickly. When you fail to do this, you create a barrier for others to help you, and increase the amount of their valuable time they must spend to help you.
  • in-kernel printf prints to a limited buffer. That is the most likely reason for “Without the if statement, the kernel only prints out ~5000 statements”. There are numerous web forum questions on this topic, or you can read the documentation (woo-hoo just noticed CUDA 12.0 is out!)

When I run the following adaptation of the code you have posted, I get the expected print-out:

$ cat t2159.cu
#include <cstdio>
__global__ void laplacian(double *u, double *out, int Nx, int Ny, double h) {

    int gindex_x = threadIdx.x + blockIdx.x * blockDim.x;
    int gindex_y = threadIdx.y + blockIdx.y * blockDim.y;
    //int lindex_x = threadIdx.x;
    //int lindex_y = threadIdx.y;
    double h12h2 = 12.0*h*h;

    //printf("blockIdx.x: %d\n",blockIdx.x);
    //printf("blockDim.x: %d\n",blockDim.x);
    if(gindex_x + Nx*gindex_y > 1048570){
        printf("gindex: %d\n",gindex_x + Nx*gindex_y);
    }

}


const int BLOCKSIZEX = 32; // number of threads per block along dim 1
const int BLOCKSIZEY = 32; // number of threads per block along dim 2

void exec_kernel(double *u_d, double *out, int Nx, int Ny, double h){

        // compute (2D) execution configuration
        int nBlocksX = Nx / BLOCKSIZEX; // how many blocks along dim 1
        int nBlocksY = Ny / BLOCKSIZEY; // how many blocks along dim 2
        dim3 dimBlock(BLOCKSIZEX, BLOCKSIZEY); // set values
        dim3 dimGrid(nBlocksX, nBlocksY);

        // call lclv kernel
        laplacian <<< dimGrid, dimBlock >>> (u_d, out, Nx, Ny, h);
        cudaDeviceSynchronize();
}

int main(){

    double *u_d = NULL;
    double *out = NULL;
    int Nx = 1024;
    int Ny = 1024;
    double h = 0;
    exec_kernel(u_d, out, Nx, Ny, h);
}
$ nvcc -o t2159 t2159.cu
$ compute-sanitizer ./t2159
========= COMPUTE-SANITIZER
gindex: 1048571
gindex: 1048572
gindex: 1048573
gindex: 1048574
gindex: 1048575
========= ERROR SUMMARY: 0 errors
$

So I think the problem lies in something you haven’t shown.

Thanks for the help, but I have fixed my problem. I was going out of bounds in a shared memory array. Unfortunately, you don’t get a seg fault if you go out of bound like you do in a normal C array. The kernel obviously just quietly crashed without error. I still have to get used to the fact that anything wrong in the kernel causes problems in the whole code.

Since asterisks are part of markup, the easiest way to write actual asterisks is to escape them by writing \*. E.g. One asterisk: *; two asterisks: **; many asterisks: *** … **.

I doubt that the kernel “just quietly crashed without error”. With CUDA, you need to actively check for errors. You can find information about it here: What is the canonical way to check for errors using the CUDA runtime API? - Stack Overflow

For example, cudaDeviceSynchronize() will return an error if an error occured up to this point, including errors within the previous kernel. You can look it up in the runtime API documentation. CUDA Runtime API :: CUDA Toolkit Documentation