Hi all,
I am not sure if I am reading this nvprof output correctly (see below). My application took 36.47 second to run. According to the nvprof output, the four kernels took about 5 seconds to run while the API cudaLaunch and cudaSetupArgument took about 30 seconds.
Am I reading this output correctly? Is my application being slowdown by excessive kernel launches?
Thanks.
efblack@blackPanther ~/Development/svdParallel/cuda/src1 $ nvprof svdCuda …/…/matrices/500by500.dat
==10851== NVPROF is profiling process 10851, command: svdCuda …/…/matrices/500by500.dat
of Blocks:4, # of threadsPerBlock: 128
Number of iterations needed: 10
It tooks 3.646571e+01 seconds to finish
==10851== Profiling application: svdCuda …/…/matrices/500by500.dat
==10851== Warning: Found 4926780 invalid records in the result.
==10851== Warning: This can happen if device ran out of memory or if a device kernel was stopped due to an assertion.
==10851== Profiling result:
Time(%) Time Calls Avg Min Max Name
38.67% 1.65324s 262144 6.3060us 5.8240us 7.9360us posM(double*, double*, double*, int, int, int, int, double*)
22.38% 956.87ms 262144 3.6500us 3.5200us 5.4720us diagonal(double*, double*, int, int, int)
19.69% 841.73ms 262145 3.2100us 3.0720us 5.1520us symmetric(double*, double*, int, int, int)
19.14% 818.14ms 524287 1.5600us 1.4080us 2.3360us preM(double*, int, int, int, double*, bool)
0.10% 4.3772ms 12 364.76us 343.30us 375.84us [CUDA memcpy DtoH]
0.02% 1.0178ms 3 339.28us 334.21us 349.03us [CUDA memcpy HtoD]
==10851== API calls:
Time(%) Time Calls Avg Min Max Name
74.37% 23.1090s 6237500 3.7040us 3.1690us 7.1323ms cudaLaunch
21.96% 6.82379s 37425000 182ns 121ns 1.21889s cudaSetupArgument
3.49% 1.08476s 6237500 173ns 126ns 2.4857ms cudaConfigureCall
0.15% 46.731ms 2 23.365ms 1.0400us 46.730ms cudaEventCreate
0.02% 7.2962ms 15 486.41us 317.43us 527.16us cudaMemcpy
0.00% 873.25us 83 10.521us 94ns 393.16us cuDeviceGetAttribute
0.00% 696.51us 4 174.13us 142.20us 239.39us cudaFree
0.00% 574.93us 4 143.73us 139.08us 152.71us cudaMalloc
0.00% 102.49us 1 102.49us 102.49us 102.49us cuDeviceGetName
0.00% 90.284us 1 90.284us 90.284us 90.284us cuDeviceTotalMem
0.00% 88.365us 10 8.8360us 8.3380us 9.1320us cudaEventSynchronize
0.00% 56.113us 20 2.8050us 2.1920us 3.3740us cudaEventRecord
0.00% 14.792us 10 1.4790us 1.3170us 1.6440us cudaEventElapsedTime
0.00% 3.9460us 2 1.9730us 661ns 3.2850us cudaEventDestroy
0.00% 929ns 2 464ns 191ns 738ns cuDeviceGetCount
0.00% 370ns 2 185ns 168ns 202ns cuDeviceGet