Nsys cudaapisum report: API stats reported in two entries

I am using nsys CLI for optimizing my program implementation. On running nsys profile (with the cudabacktrace flag set to all), I get two entries for every API in the cudaapisum report.
This is the report I got:

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)      Med (ns)     Min (ns)     Max (ns)     StdDev (ns)            Name
 --------  ---------------  ---------  ------------  ------------  ----------  -------------  ------------  ----------------------
     28.8   91,957,708,291  1,364,620      67,387.0       9,528.0         841     55,129,870     428,869.7  cudaDeviceSynchronize
     24.3   77,731,576,254    104,039     747,138.8     240,304.0      80,004     55,130,050   1,382,346.9  cudaDeviceSynchronize
     14.9   47,664,131,331  1,048,236      45,470.8       2,815.0       1,784    345,937,486   3,433,152.9  cudaFree
     13.5   43,180,784,242      6,450   6,694,695.2     239,022.0      80,004    345,937,686  43,258,632.9  cudaFree
      5.3   16,990,948,581  1,034,957      16,417.1       2,474.0       1,874    347,093,481   1,394,598.7  cudaMalloc
      4.1   13,132,337,874      6,145   2,137,077.0     287,956.0      80,005    347,093,701  17,974,529.8  cudaMalloc
      3.0    9,470,943,993  2,382,553       3,975.1       3,636.0       3,126        348,705       1,705.9  cudaLaunchKernel
      2.3    7,422,490,843  1,023,756       7,250.3       9,429.0       1,332        119,661       4,164.1  cudaStreamSynchronize
      1.5    4,749,824,328    342,495      13,868.3      13,316.0      11,262        351,229       2,247.6  cudaMemcpyAsync
      1.0    3,233,425,735    711,103       4,547.1       4,168.0       3,216        537,299       1,977.7  cudaMemset
      0.7    2,190,179,086     19,105     114,639.1      57,241.0      32,754  1,200,552,105   8,685,390.9  cudaMemcpy
      0.4    1,242,838,837        488   2,546,800.9      83,787.0      80,005  1,200,553,028  54,342,552.5  cudaMemcpy
      0.1      350,302,733     13,279      26,380.2      21,722.0      18,065     20,754,944     180,145.8  cudaMallocManaged
      0.0       98,646,469          1  98,646,469.0  98,646,469.0  98,646,469     98,646,469           0.0  cudaDeviceReset
      0.0       25,395,048         47     540,320.2      89,193.0      80,195     20,755,196   3,012,879.7  cudaMallocManaged
      0.0       21,389,943        152     140,723.3     109,467.0      99,322        348,965      74,613.1  cudaLaunchKernel
      0.0        8,096,566         72     112,452.3     112,137.0     104,271        119,872       3,469.7  cudaStreamSynchronize
      0.0        5,910,385         46     128,486.6     119,195.5     112,919        351,520      46,724.3  cudaMemcpyAsync
      0.0        5,758,285         39     147,648.3     109,231.0     103,771        537,460      96,980.2  cudaMemset
      0.0            2,595          1       2,595.0       2,595.0       2,595          2,595           0.0  cuModuleGetLoadingMode
      0.0            1,554          1       1,554.0       1,554.0       1,554          1,554           0.0  cuCtxSynchronize

I want to know why there are two entries for the same API
Is that because of the cudabacktrace flag? Only one entry without this flag.
Is this done to give an idea of the profiler overhead?
If yes, which entry refers to the profiler overhead and which one to the application itself?

@Andrey_Trachenko can you get an answer for this? Thanks.