is excessive kernel launches killing my application?

EdgarBlack · July 19, 2016, 3:37pm

Hi all,

I am not sure if I am reading this nvprof output correctly (see below). My application took 36.47 second to run. According to the nvprof output, the four kernels took about 5 seconds to run while the API cudaLaunch and cudaSetupArgument took about 30 seconds.

Am I reading this output correctly? Is my application being slowdown by excessive kernel launches?

Thanks.

efblack@blackPanther ~/Development/svdParallel/cuda/src1 $ nvprof svdCuda …/…/matrices/500by500.dat
==10851== NVPROF is profiling process 10851, command: svdCuda …/…/matrices/500by500.dat

of Blocks:4, # of threadsPerBlock: 128

Number of iterations needed: 10

It tooks 3.646571e+01 seconds to finish

==10851== Profiling application: svdCuda …/…/matrices/500by500.dat
==10851== Warning: Found 4926780 invalid records in the result.
==10851== Warning: This can happen if device ran out of memory or if a device kernel was stopped due to an assertion.
==10851== Profiling result:
Time(%) Time Calls Avg Min Max Name
38.67% 1.65324s 262144 6.3060us 5.8240us 7.9360us posM(double*, double*, double*, int, int, int, int, double*)
22.38% 956.87ms 262144 3.6500us 3.5200us 5.4720us diagonal(double*, double*, int, int, int)
19.69% 841.73ms 262145 3.2100us 3.0720us 5.1520us symmetric(double*, double*, int, int, int)
19.14% 818.14ms 524287 1.5600us 1.4080us 2.3360us preM(double*, int, int, int, double*, bool)
0.10% 4.3772ms 12 364.76us 343.30us 375.84us [CUDA memcpy DtoH]
0.02% 1.0178ms 3 339.28us 334.21us 349.03us [CUDA memcpy HtoD]

==10851== API calls:
Time(%) Time Calls Avg Min Max Name
74.37% 23.1090s 6237500 3.7040us 3.1690us 7.1323ms cudaLaunch
21.96% 6.82379s 37425000 182ns 121ns 1.21889s cudaSetupArgument
3.49% 1.08476s 6237500 173ns 126ns 2.4857ms cudaConfigureCall
0.15% 46.731ms 2 23.365ms 1.0400us 46.730ms cudaEventCreate
0.02% 7.2962ms 15 486.41us 317.43us 527.16us cudaMemcpy
0.00% 873.25us 83 10.521us 94ns 393.16us cuDeviceGetAttribute
0.00% 696.51us 4 174.13us 142.20us 239.39us cudaFree
0.00% 574.93us 4 143.73us 139.08us 152.71us cudaMalloc
0.00% 102.49us 1 102.49us 102.49us 102.49us cuDeviceGetName
0.00% 90.284us 1 90.284us 90.284us 90.284us cuDeviceTotalMem
0.00% 88.365us 10 8.8360us 8.3380us 9.1320us cudaEventSynchronize
0.00% 56.113us 20 2.8050us 2.1920us 3.3740us cudaEventRecord
0.00% 14.792us 10 1.4790us 1.3170us 1.6440us cudaEventElapsedTime
0.00% 3.9460us 2 1.9730us 661ns 3.2850us cudaEventDestroy
0.00% 929ns 2 464ns 191ns 738ns cuDeviceGetCount
0.00% 370ns 2 185ns 168ns 202ns cuDeviceGet

Robert_Crovella · July 19, 2016, 3:52pm

I wouldn’t feel comfortable discussing nvprof output until you get rid of those warning messages. Something doesn’t look right: you have approximately 6M instances of cudaLaunch but only about 1.5M actual kernels profiled. There’s really no reason you should have to profile 6M kernel launches to get an appropriate view of what your application is doing. Trim the application down before profiling.

Launching kernels of 4 blocks and 128 threads per block is not a good way to get performance out of any GPU.

It certainly does look like you have a very large number of very tiny kernel launches.

njuffa · July 19, 2016, 4:43pm

The minimal round-trip time for a kernel launch is typically about 5 microseconds, that is 5e-6 seconds. That number has not changed in many years, so presumably is due to some fundamental overhead, e.g. caused by the latency inherent in PCIe interfaces on the host and the GPU.

That means, as a rule of thumb, that one can launch on the order of 200K kernels per seconds. Six million kernel launches would then take 30 seconds, which matches the observed behavior fairly well.

When faced with a scenario of many tiny kernel launches, you would want to explore the possibility of batching the work, e.g. instead of each kernel launch handling one small matrix, let each kernel handle an entire batch of small matrices. Note that CUBLAS already supports some batched operations.

EdgarBlack · July 19, 2016, 5:52pm

Thanks to both for your answers.

I am looking at issues now.

Thanks again.

Topic		Replies	Views
Very slow kernel launches CUDA Programming and Performance	8	7720	March 28, 2015
Improve kernel launch times on Jetson TX2? Jetson TX2	4	602	October 18, 2021
Large Total Time for cudaLaunchKernel Profiling Linux Targets	6	111	September 9, 2024
Profiling CUDA Programming and Performance	2	826	August 17, 2015
reduce overhead of launching a new thread block CUDA Programming and Performance	15	4597	February 15, 2018
cudaLaunchHostFunc + cudaEventElapsedTime? CUDA Programming and Performance	4	824	August 3, 2022
Discrepancy between cudaEventElapsedTime and nvprof CUDA Programming and Performance	7	1556	March 11, 2016
How to effectively parallelize cuda kernel launches on CPU CUDA Programming and Performance	9	3050	January 19, 2018
Focused profiling with nvprof not working? Visual Profiler and nvprof	1	1175	May 28, 2020
cudaLaunch overheads CUDA Programming and Performance	2	752	August 27, 2018

is excessive kernel launches killing my application?

of Blocks:4, # of threadsPerBlock: 128

Related topics