reduce overhead of launching a new thread block

I have been puzzled by the performance of my GPU code.
So I tried an experiment: in the existing code I insert
a conditional return as the first line, which is always true.
So the kernel now does no work. As expected the time between kernel<<<blocks,32>>>
and following gpuErrchk( cudaDeviceSynchronize() ); falls but only to about 30%
of its original time.
I think means about 30% of my elapse time is disappearing in just
starting and stopping my kernel. On a GTX 745 with 2000 blocks this
averages at about 26.5 microseconds

Is this what you would expect?

How can it be reduced?

LongY reported about half a microsecond (420 tics) but that was internal to
the GPU and so does not include cudaDeviceSynchronize() etc.

The kernel has 6 scalar arguments (five int, one float) and 7 pointer/array arguments
(int[31] int[8][5][5] char* short* and three int*).

As always and help or guidance would be most welcome

                        Thank you


    Prof. W. B. Langdon
    Department of Computer Science
    University College London
    Gower Street, London WC1E 6BT, UK

GI-2018 deadline 5th February 2018
2018 Humies
EuroGP 2018
choose your background
A Field Guide to Genetic Programming
GP Bibliography

Try using nvprof, it itemizes everything, including the kernel launch time, so you can see if your theory is correct. I believe the cudaLaunch call for me is somewhere around 8us.

Dear shaklee3,
Thank you for the suggestion and data for your system, especially your
cudaLaunch time.

I guess the first point is that nvprof 7.0.28 seems to add an overhead of about 2.1
microseconds per kernal launch.

My kernel is still embedded in the application and so nvprof is reporting averages
for multiple kernels. I will see if the data can be disentangled.

Thanks again

I have estimated what nvprov would give as Avg for my kernel as
4us cudaLaunch but to this I add
cudaSetupArgument 300ns * 13
cudaPeekAtLastError 300ns
cudaConfigureCall 300ns

Giving 8500 but should probably remove 2100ns worth of nvprof overhead
ending up with 6.4 microseconds, close to your figure of “somewhere around 8us”

But way short of the actual elapse time of 26.5 microseconds
for 2000 blocks (and which increases with the number of blocks).

I wonder if there is some overhead that nvprof is not counting ?

Another ps: it turns out that the question of detailed timing has come up
However the recommendation was to use gettimeofday (with care), which is what
I hope I am doing.

So I am still struggling to see what is causing the launch overhead
or ways to reduce it.


What operating system are you on? Number one reason for high launch overhead is the use of Windows with a WDDM driver.

With WDDM (introduced with Windows 7), and even more so with WDDM 2.x under Windows 10, it is the Windows operating system that controls most aspects of GPU operation. This has benefits from Microsoft’s perspective, such as increased system stability compared to the previous Windows XP driver model. But it is often detrimental to performance as well as other aspects such as GPU memory allocation.

Average launch overhead of around 25 usec seems perfectly normal in a WDDM scenario. I am referring to the average because with WDDM the CUDA driver tries to batch launches in order to reduce the average launch overhead. Batching often causes the overhead of specific launches to fluctuate from close to the lower limit imposed by hardware (around 5 usec) to much higher values (e.g. 40 to 50 usec).

If you need / want low launch overhead, either use Windows with a TCC driver (not possible with a consumer GPU like GTX 745) or use Linux. If you then run on a host system with a fast CPU with high single-thread performance you should be able to get close to 5 usec when launching null kernels, i.e. kernels that do nothing and are not being passed any arguments. For realistic kernels with arguments, launch overhead should be expected to be around 7 to 8 usec.

The observation that use of the CUDA profiler adds about 2 usec per kernel launch seems very plausible given that the profiler needs to insert a hook into the launch mechanism in order to log data about launches.

Note that all overhead beyond a ~5 usec hardware baseline is caused by software, and that this is single-threaded work, meaning the additional software overhead is primarily influenced by single-core CPU performance and therefore core frequency, secondarily by host system memory performance. For that reason I recommend CPUs with a base core frequency >= 3.5 GHz, and system memory that uses the fastest speed grade the CPU supports and with as many DDR4 channels as possible.

An interesting question is whether kernel launch overhead was negatively impacted by the operating system workarounds for recently identified CPU vulnerabilities (Meltdown, Spectre). I don’t know the answer to that but it seems possible.

In general, one should partition work such that kernel run times are in the multi-millisecond range even for the fastest GPU, to avoid undue impact of launch overhead on application-level performance. I know that this is not always possible, and that there are real-life use cases that result in extremely short kernel run times (and run times are further shortened with each new generation of GPUs).

Note: The 5 usec lower limit to hardware-based launch overhead applies when a PCIe gen3 x16 interface is used. I could imagine that higher overhead may be observed for interfaces of reduced width (I think some people use interface sizes down to x1 in crypto-currency mining rigs) or when PCIe gen2 is used (the latter condition shouldn’t be encountered with any system built in the past five years).

Dear njuffa,
Thank you for your detailed and thoughtful reply.

I am using Centos 7
3.10.0-693.17.1.el7.x86_64 #1 SMP Thu Jan 25 20:13:58 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
The kernel is supposedly patched for recently identified CPU vulnerabilities but
I do not have any timings for before the kernel patch was added.
The host is an 8 core i7-4790 CPU @ 3.60GHz
The GPU is a 4GB CTX 745

Will the Linux driver also batch together kernel launches?
And thereby reduce the overhead?

In summary, it seems even with Linux the time to launch a null kernel
(about 26.5 microseconds) is pretty much what you would expect with
current microsoft windows
and you are not surprised by the discrepancy between the time measured
on the host and the times nvprof reports?

Once again many thanks for your help

To my knowledge, there is no launch batching in the CUDA driver for any platform other than Windows with the default WDDM driver. On all other platforms, launch overhead for null kernels should be close to the 5 usec lower limit imposed by the hardware interface, and this has been unchanged for years, across several CUDA versions. I am surprised by your observations, and at present I have no idea what could be going on.

A few caveats: I have yet to try CUDA 9.x and related drivers, and I haven’t installed any patches for the Meltdown & Spectre issues yet.

Is the GPU in x16 slot? If you run the bandwidthTest app that ships with CUDA, what host->device bandwidth does it show? Have you set an environment variable such as CUDA_LAUNCH_BLOCKING by any chance? I am not even sure that particular environment variable still exists; it used to be a debug feature. Is there any logging feature at OS level that may have been turned on and that would impact launch overhead?

I am aware of some reasonably widely used CUDA-accelerated applications that make use of short-running kernels, so if there were any general issue with increased launch overhead with recent software, I would expect there to be a number of complaints in these forums, but there are no others so far I have seen. So for now I assume the issue is local to your setup.

Have you considered a possible flaw in your measurement methodology?

Dear njuffa,
Thank you for bearing with me.

[CUDA Bandwidth Test] - Starting…
Running on…

Device 0: GeForce GTX 745
Quick Mode

Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 12354.0

Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 12002.7

Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 24582.4

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

nvvp typically shows data rates (to/from device-host at about 8 GB/second
(on about 10kbyte transfers)
(Under System Memory it says PCIe configuration: Gen3 x16 8 Gbit/s.
yup nvvp PDF output does say bit)

ls PCI gives

00:00.0 Host bridge: Intel Corporation 4th Gen Core Processor DRAM Controller (rev 06)
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor PCI Express x16 Controller (rev 06)
00:14.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB xHCI (rev 04)
00:16.0 Communication controller: Intel Corporation 8 Series/C220 Series Chipset Family MEI Controller #1 (rev 04)
00:16.3 Serial controller: Intel Corporation 8 Series/C220 Series Chipset Family KT Controller (rev 04)
00:19.0 Ethernet controller: Intel Corporation Ethernet Connection I217-LM (rev 04)
00:1a.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB EHCI #2 (rev 04)
00:1b.0 Audio device: Intel Corporation 8 Series/C220 Series Chipset High Definition Audio Controller (rev 04)
00:1c.0 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #1 (rev d4)
00:1c.1 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #2 (rev d4)
00:1d.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB EHCI #1 (rev 04)
00:1f.0 ISA bridge: Intel Corporation Q87 Express LPC Controller (rev 04)
00:1f.2 RAID bus controller: Intel Corporation SATA Controller [RAID mode] (rev 04)
00:1f.3 SMBus: Intel Corporation 8 Series/C220 Series Chipset Family SMBus Controller (rev 04)
01:00.0 VGA compatible controller: NVIDIA Corporation GM107 [GeForce GTX 745] (rev a2)
01:00.1 Audio device: NVIDIA Corporation Device 0fbc (rev a1)
03:00.0 PCI bridge: Texas Instruments XIO2001 PCI Express-to-PCI Bridge

Nothing unexpected showing up as setenv cuda variable

I am unaware of any unusual operating system logging

I will double check my timings, etc.


The bandwidth test shows that the GPU is definitely sitting in a PCIe gen 3 x16 slot, as it could not get to a 12+ GB/sec transfer rate otherwise. Based on the totality of the information provided so far, I have no plausible working hypothesis at this time.

Dear njuffa,
Hmm I am not sure if this will help but I have also tried
the same CUDA source code on a K20c (compute level 3.5) and get something similar.

I should have said in both cases that the time to launch the null kernel
varies linearly with the number of blocks, the figures I gave you refer
to launching 2000 blocks of 32 threads.

Linear regression gives
GTX 745 9263.2 + 8.94564nblocks nanosecond
Tesla K20 13344.5 + 2.85562
nblocks nanosecond

The null kernel (ie first statement is conditional return) has
4 int arguments, 2 int arrays, 1 float, char*, short*, unsigned int* int* int*,
all but the last arguments are const inputs all pointers are restrict
The registers etc are almost the same GTX 745 and K20
ptxas info : Used 25 registers, 192 bytes smem, 400 bytes cmem[0], 36 bytes cmem[2]
ptxas info : Used 25 registers, 192 bytes smem, 400 bytes cmem[0], 64 bytes cmem[2]

The Tesla K20 is in an older host with slower CPU (2.2GHz) and PCIe but has the more
recent version of CUDA 9.1. I also had to disable GNU GCC link time optimisation -fno-lto

Sorry, I am out of ideas for time being. Your observations don’t make sense to me. In terms of practical solutions, you would want to attempt to pack more work into each kernel, so as to minimize the overhead of kernel launches.

To avoid confusion in terminology, please note that I defined null kernel as an empty kernel that takes no arguments. Obviously, pushing kernel arguments into the command queue of the GPU involves work, although the impact of increasing the number of arguments on launch overhead is minor in my recollection.

You seem to be mixing up threadblock scheduling latency (title, what you linked to in the other thread) with kernel launch latency (most of the discussion that I can see) in this thread.

I would think a test case for measuring kernel launch latency along the lines you describe would only require a max of about 30 lines of code. Why not provide an exact test case?

Here is my test case:

$ cat
#include <stdio.h>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;

__global__ void tkernel(){

int main(){

  tkernel<<<2000, 32>>>();
  unsigned long long dt = dtime_usec(0);
  unsigned long long dt1 = dt;
  tkernel<<<2000, 32>>>();
  dt = dtime_usec(dt);
  dt1 = dtime_usec(dt1);
  printf("kernel launch: %fs, kernel duration: %fs\n", dt/(float)USECPSEC, dt1/(float)USECPSEC);
$ nvcc -arch=sm_35 -o t47
$ ./t47
kernel launch: 0.000009s, kernel duration: 0.000019s

so, 9us launch latency (host code duration), 19us total kernel duration (launch + execution)

CUDA 9.1, CentOS 7.4

Dear nuffa and txbob,
I was hoping to find some nasty interaction between
my code and cuda and so carefully chipped my stuff away until I was left with something
that looks very much like txbob’s code. Perhaps I should post my anyway
but I have also run t47 and it pretty much agrees with mine.

./t47 gives:
On my GTX 745 kernel launch: 0.000004s, kernel duration: 0.000026s (total 26 microseconds)
and the K20c kernel launch: 0.000008s, kernel duration: 0.000022s (total 22 microseconds)

Many thanks
ps: Removing the original 13 arguments made a little difference

ps: for completeness here is the output from nvprof on t47 on the two set ups

GTX 745 CUDA 7.0
/usr/local/cuda/bin/nvprof t47
==12757== NVPROF is profiling process 12757, command: t47
kernel launch: 9, kernel duration: 31 microseconds
==12757== Profiling application: t47
==12757== Profiling result:
Time(%) Time Calls Avg Min Max Name
100.00% 33.056us 2 16.528us 16.160us 16.896us tkernel(void)

==12757== API calls:
Time(%) Time Calls Avg Min Max Name
99.88% 602.03ms 2 301.01ms 6.5180us 602.02ms cudaLaunch
0.10% 573.18us 83 6.9050us 277ns 264.97us cuDeviceGetAttribute
0.01% 71.064us 1 71.064us 71.064us 71.064us cuDeviceTotalMem
0.01% 40.206us 2 20.103us 19.041us 21.165us cudaDeviceSynchronize
0.01% 37.252us 1 37.252us 37.252us 37.252us cuDeviceGetName
0.00% 8.1570us 2 4.0780us 1.0000us 7.1570us cudaConfigureCall
0.00% 2.3450us 2 1.1720us 337ns 2.0080us cuDeviceGetCount
0.00% 663ns 2 331ns 290ns 373ns cuDeviceGet

Tesla K20 CUDA 9.1
/usr//local/cuda/bin/nvprof ./t47
==13335== NVPROF is profiling process 13335, command: ./t47
kernel launch: 0.000027s, kernel duration: 0.000041s
==13335== Profiling application: ./t47
==13335== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 13.824us 2 6.9120us 6.7200us 7.1040us tkernel(void)
API calls: 99.64% 478.37ms 2 239.18ms 23.379us 478.35ms cudaLaunch
0.27% 1.3183ms 188 7.0120us 181ns 273.16us cuDeviceGetAttribute
0.05% 244.06us 2 122.03us 118.89us 125.17us cuDeviceTotalMem
0.02% 114.92us 2 57.460us 54.038us 60.882us cuDeviceGetName
0.01% 33.645us 2 16.822us 12.740us 20.905us cudaDeviceSynchronize
0.00% 17.028us 2 8.5140us 1.4670us 15.561us cudaConfigureCall
0.00% 2.4450us 3 815ns 310ns 1.3760us cuDeviceGetCount
0.00% 2.3780us 4 594ns 260ns 1.0880us cuDeviceGet

One suggestion I would have is that if you’re looking at the ~25us overhead as being a big part of your overall (actual kernel) execution time, then you may want to see if you can expose more work to be done in that CUDA kernel call to amortize the overhead.