Question the time cost of a blank kernel

I have a kernel that does nothing.

When I launch it with different gridDim and blockDim, I find the latency differs a lot, from 1us to 100us.

How is the time cost for a blank kernel?

Is it caused by the kernel schedule cost?
If the threads are numerous, where does the time cost come from?
Thank you.

A kernel that “does nothing” still has a few SASS instructions that need to be executed by every launched thread. You can discover this with the cuobjdump tool.

The time cost for a thread to execute SASS instructions is not zero.

Therefore it is reasonable to assume, and borne out in measurement, that as you increase the number of threads, the kernel duration will tend to increase. This is more-or-less true for any CUDA kernel, empty or not.

1 Like

As Robert Crovella said, a null kernel is not completely empty. If one looks at the disassembly, it appears to load a stack pointer, executes a few NOPs, then issues an EXIT instruction. Also, while the hardware schedulers in a GPU work with minimal overhead, that overhead is not zero. Below I am showing sample output from my system when using CUDA 12.3.: I used one grid with 1K blocks, and another grid with 1M blocks. I built on a Windows 10 platform with

nvcc -o null_kernels.exe -arch=sm_61 -Xcompiler "/W4 /O2 /favor:INTEL64 /arch:AVX512 /fp:precise" null_kernels.cu

Device 0: "Quadro RTX 4000"; (sm_75); driver mode = WDDM
Making 3 passes, each comprising 100 launches
Grid of 1000 thread blocks, each comprising 256 threads
time per kernel: launch only         = 1.58 usec
time per kernel: launch & completion = 7.63 usec

Device 1: "Quadro P2000"; (sm_61); driver mode = TCC
Making 3 passes, each comprising 100 launches
Grid of 1000 thread blocks, each comprising 256 threads
time per kernel: launch only         = 1.67 usec
time per kernel: launch & completion = 5.32 usec
Device 0: "Quadro RTX 4000"; (sm_75); driver mode = WDDM
Making 3 passes, each comprising 100 launches
Grid of 1000000 thread blocks, each comprising 256 threads
time per kernel: launch only         = 1.63 usec
time per kernel: launch & completion = 976.26 usec

Device 1: "Quadro P2000"; (sm_61); driver mode = TCC
Making 3 passes, each comprising 100 launches
Grid of 1000000 thread blocks, each comprising 256 threads
time per kernel: launch only         = 1.84 usec
time per kernel: launch & completion = 2614.04 usec

From the second data set it is clear that execution time per thread block is on the order of a nanosecond. Considering that the GPUs here run at a clock frequency of between 1 and 2 GHz, that tells us that execution requires a few clock cycles only, as we would expect.

Here is the source code of my program for reference purposes (I omitted CUDA error checking which is not recommended practice):

#include <stdio.h>
#include <stdlib.h>

__global__ void null_kernel (void) {}

#define BLOCKS_PER_GRID  (1000000)
#define THREADS_PER_BLK  (256)
#define REPS             (100) // number of consecutive launches per pass
#define MAX_PASS         (3)   // number of overall passes till steady state

// A routine to give access to a high precision timer on most systems.
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
    LARGE_INTEGER t;
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    }
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
    }
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif

int main (void)
{
    double start_launch, stop_launch;
    double start_launch_and_complete, stop_launch_and_complete;
    dim3 gridDim  = {BLOCKS_PER_GRID};
    dim3 blockDim = {THREADS_PER_BLK};
    int device_count = 0;
    
    cudaGetDeviceCount (&device_count);
    for (int k = 0; k < device_count; k++) {
        cudaDeviceProp deviceProp;
        cudaSetDevice (k);
        cudaGetDeviceProperties (&deviceProp, k);
        printf("\nDevice %d: \"%s\"; (sm_%d%d); ", k, deviceProp.name, 
               deviceProp.major, deviceProp.minor);
#if defined(WIN32) || defined(_WIN32) || defined(WIN64) || defined(_WIN64)
        printf ("driver mode = %s\n",  deviceProp.tccDriver ? "TCC" : "WDDM");
#endif

        printf ("Making %d passes, each comprising %d launches\n", 
                MAX_PASS, REPS);
    
        for (int j = 0; j < MAX_PASS; j++) {
            cudaDeviceSynchronize();
            start_launch = second();
            for (int i = 0; i < REPS; i++) {
                null_kernel<<<gridDim,blockDim>>>();
            }
            stop_launch = second();
            cudaDeviceSynchronize();
            start_launch_and_complete = second();
            for (int i = 0; i < REPS; i++) {
                null_kernel<<<gridDim,blockDim>>>();
            }
            cudaDeviceSynchronize(); 
            stop_launch_and_complete = second();
            
            if (j == (MAX_PASS - 1)) {
                printf ("Grid of %d thread blocks, each comprising %d threads\n", 
                        gridDim.x, blockDim.x);
                printf ("time per kernel: launch only         = %.2f usec\n", 
                        ((stop_launch - start_launch) / REPS) * 1e6);
                printf ("time per kernel: launch & completion = %.2f usec\n", 
                        ((stop_launch_and_complete - start_launch_and_complete) / 
                     REPS) * 1e6);
            }
        }
    }
}

Thank you for the detailed explanation!