cudaFree takes approx 99.5% of total time.

I am using nvgraph sssp(single source shortest path) function in a code. However, most of the time is required for cudaFree only.
The code is mainly based on the Example provided in CUDA Toolkit website.
Actual Code is available here
Also, I wanted to know if there is any way to control kernel calling in the sssp code provided in the Example (like to control the number of blocks and threads)? Also, if there is a device implementation for sssp?
The output of nvprof is given below.

Time(%) Time Calls Avg Min Max Name
99.69% 4.38338s 14 313.10ms 3.8600us 3.27544s cudaFree
0.21% 9.4005ms 1060 8.8680us 101ns 1.1981ms cuDeviceGetAttribute
0.04% 1.6128ms 12 134.40us 90.140us 215.06us cuDeviceTotalMem
0.03% 1.3653ms 11 124.12us 4.2720us 742.99us cudaMalloc
0.02% 773.28us 12 64.439us 49.922us 81.480us cuDeviceGetName
0.00% 164.44us 26 6.3240us 4.0880us 22.405us cudaLaunch
0.00% 110.55us 10 11.055us 6.9960us 16.346us cudaMemcpyAsync
0.00% 56.303us 6 9.3830us 4.5840us 12.582us cudaMemcpy
0.00% 21.457us 9 2.3840us 1.6520us 4.3160us cudaFuncGetAttributes
0.00% 21.135us 51 414ns 240ns 1.4850us cudaGetDevice
0.00% 18.614us 5 3.7220us 2.8640us 6.2670us cudaMemsetAsync
0.00% 17.340us 52 333ns 221ns 1.0440us cudaDeviceGetAttribute
0.00% 15.224us 61 249ns 129ns 3.6550us cudaSetupArgument
0.00% 10.989us 17 646ns 410ns 1.8460us cudaEventCreateWithFlags
0.00% 9.2510us 52 177ns 123ns 604ns cudaGetLastError
0.00% 8.4280us 5 1.6850us 1.4040us 2.5740us cudaStreamSynchronize
0.00% 8.3230us 17 489ns 359ns 1.1080us cudaEventDestroy
0.00% 7.0270us 5 1.4050us 1.0530us 2.3390us cudaEventQuery
0.00% 6.1460us 20 307ns 180ns 747ns cuDeviceGet
0.00% 6.0010us 26 230ns 149ns 988ns cudaConfigureCall
0.00% 5.6200us 6 936ns 724ns 1.4690us cudaEventRecord
0.00% 5.5700us 5 1.1140us 298ns 1.6800us cudaSetDevice
0.00% 4.3360us 2 2.1680us 1.5950us 2.7410us cudaThreadSynchronize
0.00% 3.4680us 5 693ns 177ns 1.6300us cuDeviceGetCount
0.00% 1.7980us 2 899ns 647ns 1.1510us cuInit
0.00% 905ns 2 452ns 229ns 676ns cuDriverGetVersion

cudafree came up recently

ps: longer discussion on missing persistence mode in tech report:

Your code shows 14 calls to cudaFree. the longest one of those is occupying 3.27s out of the total application timeline of 4.38s.

cudaFree is a synchronizing call. That means that it waits until all previous asynchronous work on the GPU is complete, before it actually performs the free operation.

since your code does not appear to contain this and you are calling nvgraph, its safe to assume that nvgraph is issuing a bunch of asynchronous work, and then ending it with a cudaFree operation(s). This is a reasonable design pattern for a library call/function.

It does not mean that the cudaFree operation itself took 3s. It means that the CPU was waiting for 3s for previous GPU work to finish, before it proceeded.