is excessive kernel launches killing my application?

Hi all,

I am not sure if I am reading this nvprof output correctly (see below). My application took 36.47 second to run. According to the nvprof output, the four kernels took about 5 seconds to run while the API cudaLaunch and cudaSetupArgument took about 30 seconds.

Am I reading this output correctly? Is my application being slowdown by excessive kernel launches?

Thanks.

efblack@blackPanther ~/Development/svdParallel/cuda/src1 $ nvprof svdCuda …/…/matrices/500by500.dat
==10851== NVPROF is profiling process 10851, command: svdCuda …/…/matrices/500by500.dat

of Blocks:4, # of threadsPerBlock: 128

Number of iterations needed: 10

It tooks 3.646571e+01 seconds to finish

==10851== Profiling application: svdCuda …/…/matrices/500by500.dat
==10851== Warning: Found 4926780 invalid records in the result.
==10851== Warning: This can happen if device ran out of memory or if a device kernel was stopped due to an assertion.
==10851== Profiling result:
Time(%) Time Calls Avg Min Max Name
38.67% 1.65324s 262144 6.3060us 5.8240us 7.9360us posM(double*, double*, double*, int, int, int, int, double*)
22.38% 956.87ms 262144 3.6500us 3.5200us 5.4720us diagonal(double*, double*, int, int, int)
19.69% 841.73ms 262145 3.2100us 3.0720us 5.1520us symmetric(double*, double*, int, int, int)
19.14% 818.14ms 524287 1.5600us 1.4080us 2.3360us preM(double*, int, int, int, double*, bool)
0.10% 4.3772ms 12 364.76us 343.30us 375.84us [CUDA memcpy DtoH]
0.02% 1.0178ms 3 339.28us 334.21us 349.03us [CUDA memcpy HtoD]

==10851== API calls:
Time(%) Time Calls Avg Min Max Name
74.37% 23.1090s 6237500 3.7040us 3.1690us 7.1323ms cudaLaunch
21.96% 6.82379s 37425000 182ns 121ns 1.21889s cudaSetupArgument
3.49% 1.08476s 6237500 173ns 126ns 2.4857ms cudaConfigureCall
0.15% 46.731ms 2 23.365ms 1.0400us 46.730ms cudaEventCreate
0.02% 7.2962ms 15 486.41us 317.43us 527.16us cudaMemcpy
0.00% 873.25us 83 10.521us 94ns 393.16us cuDeviceGetAttribute
0.00% 696.51us 4 174.13us 142.20us 239.39us cudaFree
0.00% 574.93us 4 143.73us 139.08us 152.71us cudaMalloc
0.00% 102.49us 1 102.49us 102.49us 102.49us cuDeviceGetName
0.00% 90.284us 1 90.284us 90.284us 90.284us cuDeviceTotalMem
0.00% 88.365us 10 8.8360us 8.3380us 9.1320us cudaEventSynchronize
0.00% 56.113us 20 2.8050us 2.1920us 3.3740us cudaEventRecord
0.00% 14.792us 10 1.4790us 1.3170us 1.6440us cudaEventElapsedTime
0.00% 3.9460us 2 1.9730us 661ns 3.2850us cudaEventDestroy
0.00% 929ns 2 464ns 191ns 738ns cuDeviceGetCount
0.00% 370ns 2 185ns 168ns 202ns cuDeviceGet

I wouldn’t feel comfortable discussing nvprof output until you get rid of those warning messages. Something doesn’t look right: you have approximately 6M instances of cudaLaunch but only about 1.5M actual kernels profiled. There’s really no reason you should have to profile 6M kernel launches to get an appropriate view of what your application is doing. Trim the application down before profiling.

Launching kernels of 4 blocks and 128 threads per block is not a good way to get performance out of any GPU.

It certainly does look like you have a very large number of very tiny kernel launches.

The minimal round-trip time for a kernel launch is typically about 5 microseconds, that is 5e-6 seconds. That number has not changed in many years, so presumably is due to some fundamental overhead, e.g. caused by the latency inherent in PCIe interfaces on the host and the GPU.

That means, as a rule of thumb, that one can launch on the order of 200K kernels per seconds. Six million kernel launches would then take 30 seconds, which matches the observed behavior fairly well.

When faced with a scenario of many tiny kernel launches, you would want to explore the possibility of batching the work, e.g. instead of each kernel launch handling one small matrix, let each kernel handle an entire batch of small matrices. Note that CUBLAS already supports some batched operations.

Thanks to both for your answers.

I am looking at issues now.

Thanks again.