Kernel invocation time Minimum kernel invocation time

What is the minimum kernel invocation time people have seen. I’ve written a very simple program that kicks off a kernel that does nothing. Here is the output of the timing measurement:


c:\Program Files\NVIDIA Corporation\NVIDIA CUDA SDK\bin\win32\Release>minKernelT

ime.exe

Processing time: 0.013412 (ms)

Press ENTER to exit…


Initially it was in the tens of milliseconds, but I found a post that indicated the method to speed this up is to allocate some memory on the unit. After doing that I see times around 13-14 us. Which is still FAR too long. I’ve written a 8x8 DCT algorithm and this invocation time dominates the performance of the kernel. Is there a way to reduce the time it takes to invoke a kernel?

Here is my code:

// includes, project

#include <stdio.h>

#include <cuda.h>

#include <cutil.h>

__global__ void

testKernel( ) 

{

}

int

main( int argc, char** argv) 

{

    CUT_DEVICE_INIT();

   unsigned int num_threads = 32;

   // setup execution parameters

    dim3  grid( 1, 1, 1);

    dim3  threads( num_threads, 1, 1);

	float * tmp;

	cudaMalloc((void**)&tmp, 1);

   // execute the kernel

    unsigned int timer = 0;

    CUT_SAFE_CALL( cutCreateTimer( &timer));

    CUT_SAFE_CALL( cutStartTimer( timer));

    testKernel<<< grid, threads >>>();

    CUT_SAFE_CALL( cutStopTimer( timer));

   // check if kernel execution generated and error

    CUT_CHECK_ERROR("Kernel execution failed");

   printf( "Processing time: %f (ms)\n", cutGetTimerValue( timer));

    CUT_SAFE_CALL( cutDeleteTimer( timer));

   CUT_EXIT(argc, argv);

}

You should call testKernel repeatedly and divide the total time by the number of calls. The very first call of a kernel in a program takes longer than the later calls due to initialization overhead. The kernel binary is sent to the card, and possibly transformed further by the CUDA driver.

That is why they say that it is only faster for a great amount of computations.
With a small amount of computations the overhead of invocating the kernel and allocating memory on the device is just way to big.

But the driver initialization only happens once at program startup. Program startup is already slow because the program needs to be loaded from the disk, shared libraries loaded, etc… so it is not really noticeable.

I can check again, but I believe the fastest kernel launch I’ve seen for an empty kernel is ~10 microseconds. The overhead depends linearly on the number of blocks you request in the grid, you you may get/see different amounts of overhead quoted in the forums based on the grid size each test was performed with.

Are there any just-in-time code transformations applied by the driver to the cubin when it is first run? I’ve never been entirely clear whether the cubin is the native machine code format of the GPU, or yet another virtual machine target.

As far as I know, there are some just-in-time type transformations. But I would assume that these are done on a per cubin basis and not per kernel. Some thorough testing might support one or the other, but measuring such short startup times is difficult to do correctly.

I think this is a question for Wumpus…