I am using nsys CLI for optimizing my program implementation. On running nsys profile (with the cudabacktrace
flag set to all
), I get two entries for every API in the cudaapisum
report.
This is the report I got:
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ------------ ------------ ---------- ------------- ------------ ----------------------
28.8 91,957,708,291 1,364,620 67,387.0 9,528.0 841 55,129,870 428,869.7 cudaDeviceSynchronize
24.3 77,731,576,254 104,039 747,138.8 240,304.0 80,004 55,130,050 1,382,346.9 cudaDeviceSynchronize
14.9 47,664,131,331 1,048,236 45,470.8 2,815.0 1,784 345,937,486 3,433,152.9 cudaFree
13.5 43,180,784,242 6,450 6,694,695.2 239,022.0 80,004 345,937,686 43,258,632.9 cudaFree
5.3 16,990,948,581 1,034,957 16,417.1 2,474.0 1,874 347,093,481 1,394,598.7 cudaMalloc
4.1 13,132,337,874 6,145 2,137,077.0 287,956.0 80,005 347,093,701 17,974,529.8 cudaMalloc
3.0 9,470,943,993 2,382,553 3,975.1 3,636.0 3,126 348,705 1,705.9 cudaLaunchKernel
2.3 7,422,490,843 1,023,756 7,250.3 9,429.0 1,332 119,661 4,164.1 cudaStreamSynchronize
1.5 4,749,824,328 342,495 13,868.3 13,316.0 11,262 351,229 2,247.6 cudaMemcpyAsync
1.0 3,233,425,735 711,103 4,547.1 4,168.0 3,216 537,299 1,977.7 cudaMemset
0.7 2,190,179,086 19,105 114,639.1 57,241.0 32,754 1,200,552,105 8,685,390.9 cudaMemcpy
0.4 1,242,838,837 488 2,546,800.9 83,787.0 80,005 1,200,553,028 54,342,552.5 cudaMemcpy
0.1 350,302,733 13,279 26,380.2 21,722.0 18,065 20,754,944 180,145.8 cudaMallocManaged
0.0 98,646,469 1 98,646,469.0 98,646,469.0 98,646,469 98,646,469 0.0 cudaDeviceReset
0.0 25,395,048 47 540,320.2 89,193.0 80,195 20,755,196 3,012,879.7 cudaMallocManaged
0.0 21,389,943 152 140,723.3 109,467.0 99,322 348,965 74,613.1 cudaLaunchKernel
0.0 8,096,566 72 112,452.3 112,137.0 104,271 119,872 3,469.7 cudaStreamSynchronize
0.0 5,910,385 46 128,486.6 119,195.5 112,919 351,520 46,724.3 cudaMemcpyAsync
0.0 5,758,285 39 147,648.3 109,231.0 103,771 537,460 96,980.2 cudaMemset
0.0 2,595 1 2,595.0 2,595.0 2,595 2,595 0.0 cuModuleGetLoadingMode
0.0 1,554 1 1,554.0 1,554.0 1,554 1,554 0.0 cuCtxSynchronize
I want to know why there are two entries for the same API
Is that because of the cudabacktrace
flag? Only one entry without this flag.
Is this done to give an idea of the profiler overhead?
If yes, which entry refers to the profiler overhead and which one to the application itself?