Hi, All
I am using nsight system on DGX-A100 to get the unified memory page fault information. Here is the command I tried
nsys profile --stats=true --cuda-um-gpu-page-faults=true --cuda-um-cpu-page-faults=true --show-output=true ./myapplication
However, in the reported summary, I can only see the cudaMallocManaged
but not see any detailed statistics regarding the unified memory page fault at the CPU or GPU
. Here is the summary I got by running the above command.
CUDA API Statistics:
Time(%) Total Time (ns) Num Calls Average Minimum Maximum StdDev Name
------- --------------- --------- ----------- --------- --------- -------- --------------------
99.6 307198070 1 307198070.0 307198070 307198070 0.0 cudaMallocManaged
0.2 656747 3 218915.7 3490 648467 372002.9 cudaMalloc
0.2 504097 1 504097.0 504097 504097 0.0 cudaEventSynchronize
0.0 88401 3 29467.0 5551 76280 40544.4 cudaFree
0.0 49250 2 24625.0 20810 28440 5395.2 cudaMemcpy
0.0 38349 1 38349.0 38349 38349 0.0 cudaLaunchKernel
0.0 11751 2 5875.5 3871 7880 2834.8 cudaEventRecord
0.0 4180 2 2090.0 660 3520 2022.3 cudaEventCreate
CUDA Kernel Statistics:
Time(%) Total Time (ns) Instances Average Minimum Maximum StdDev Name
------- --------------- --------- -------- ------- ------- ------ ------------------------------------------------------------------------
100.0 498458 1 498458.0 498458 498458 0.0 mykernel(float*, float const*, int const*, int const*, int, int, int)
CUDA Memory Operation Statistics (by time):
Time(%) Total Time (ns) Operations Average Minimum Maximum StdDev Operation
------- --------------- ---------- ------- ------- ------- ------ ------------------
100.0 10816 2 5408.0 4288 6528 1583.9 [CUDA memcpy HtoD]
CUDA Memory Operation Statistics (by size in KiB):
Total Operations Average Minimum Maximum StdDev Operation
------ ---------- ------- ------- ------- ------ ------------------
52.992 2 26.496 10.578 42.414 22.511 [CUDA memcpy HtoD]
Operating System Runtime API Statistics:
Time(%) Total Time (ns) Num Calls Average Minimum Maximum StdDev Name
------- --------------- --------- ---------- ------- -------- ---------- --------------
48.1 331545542 18 18419196.8 80550 83054204 23063304.4 poll
42.2 290751254 1710 170030.0 1020 17036417 630506.2 ioctl
5.3 36739837 15 2449322.5 20810 20961217 5796984.8 sem_timedwait
2.9 19888846 142 140062.3 2050 18737729 1571764.3 open64
0.9 5934531 78 76083.7 1311 5708922 646078.8 fopen
0.4 3059706 95 32207.4 1120 1082445 110447.7 mmap
0.0 198128 4 49532.0 44259 54410 4445.5 pthread_create
0.0 176158 4 44039.5 39069 58120 9393.0 fgets
0.0 117225 12 9768.8 1940 62870 16921.7 write
0.0 81487 29 2809.9 1140 5170 777.7 read
0.0 67478 24 2811.6 1090 19799 4060.4 fgetc
0.0 35770 6 5961.7 2630 13100 4412.6 open
0.0 27239 8 3404.9 2040 7120 1661.2 munmap
0.0 25119 16 1569.9 1090 2300 346.0 fclose
0.0 15110 8 1888.8 1060 2660 608.1 fcntl
0.0 12309 1 12309.0 12309 12309 0.0 pipe2
0.0 10129 2 5064.5 4920 5209 204.4 socket
0.0 6240 1 6240.0 6240 6240 0.0 fopen64
0.0 6050 1 6050.0 6050 6050 0.0 connect
0.0 4270 1 4270.0 4270 4270 0.0 fflush
0.0 3650 1 3650.0 3650 3650 0.0 fwrite
0.0 1960 1 1960.0 1960 1960 0.0 bind
0.0 1240 1 1240.0 1240 1240 0.0 listen
Did I miss some key steps?
Thanks!