Thank you for your response.
- Unfortunately, even after adding
--cuda-memory-usage=true
, It still cannot collect information on page faults.
nsys profile --stats=true --cuda-um-gpu-page-faults=true --cuda-um-cpu-page-faults=true --show-output=true --cuda-memory-usage=true python test_um_pow.py
The output results are as follows:
(Sorry, it seems that I cannot share the report file in this window.)
Generating '/tmp/nsys-report-0e4d.qdstrm'
[ 1/11] [========================100%] report4.nsys-rep
[ 2/11] [========================100%] report4.sqlite
[ 3/11] Executing 'nvtx_sum' stats report
SKIPPED: /mypath/report4.sqlite does not contain NV Tools Extension (NVTX) data.
[ 4/11] Executing 'osrt_sum' stats report
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ------------ ----------- --------- ----------- ------------ ----------------------
99.9 130,242,996,683 7,406 17,586,145.9 8,104,574.0 2,530 100,236,708 29,916,823.1 poll
0.1 135,405,696 1,321 102,502.4 17,617.0 583 32,580,810 1,217,796.0 ioctl
0.0 20,158,404 50 403,168.1 5,523.5 1,167 19,701,557 2,784,969.6 fopen
0.0 10,120,289 2 5,060,144.5 5,060,144.5 5,043,264 5,077,025 23,872.6 nanosleep
0.0 7,345,055 1,815 4,046.9 1,258.0 463 226,117 10,697.3 read
0.0 5,495,187 23 238,921.2 209,435.0 75,966 645,548 143,593.4 pthread_create
0.0 3,758,690 901 4,171.7 3,924.0 1,861 21,435 1,524.4 open64
0.0 3,412,070 1,599 2,133.9 185.0 43 92,929 11,607.1 fgets
0.0 3,365,969 41 82,096.8 83,985.0 56,443 87,929 7,057.8 sleep
0.0 2,473,778 156 15,857.6 6,166.0 2,784 1,047,080 84,282.8 mmap64
0.0 1,446,804 10 144,680.4 81,484.0 59,584 739,766 209,296.7 sem_timedwait
0.0 759,349 26,695 28.4 21.0 18 6,453 72.1 pthread_cond_signal
0.0 537,717 26 20,681.4 8,349.5 2,228 283,668 54,208.7 mmap
0.0 217,353 13 16,719.5 9,660.0 1,990 118,902 31,214.9 open
0.0 146,602 46 3,187.0 2,580.0 831 12,513 2,525.6 fclose
0.0 82,619 13 6,355.3 6,346.0 2,521 13,869 2,960.3 munmap
0.0 48,756 13 3,750.5 3,534.0 861 6,906 1,744.2 write
0.0 39,050 71 550.0 439.0 343 1,809 315.5 fcntl
0.0 32,866 5 6,573.2 6,620.0 1,325 13,184 5,016.8 putc
0.0 25,066 68 368.6 329.0 23 1,781 232.8 sigaction
0.0 22,394 5 4,478.8 3,254.0 1,987 7,749 2,661.4 fopen64
0.0 21,458 15 1,430.5 156.0 69 17,100 4,348.1 fwrite
0.0 18,786 140 134.2 71.5 36 1,848 242.9 fflush
0.0 9,897 2 4,948.5 4,948.5 2,931 6,966 2,853.2 socket
0.0 9,751 1 9,751.0 9,751.0 9,751 9,751 0.0 connect
0.0 9,269 2 4,634.5 4,634.5 2,109 7,160 3,571.6 fread
0.0 7,244 1 7,244.0 7,244.0 7,244 7,244 0.0 pipe2
0.0 4,901 64 76.6 37.0 29 289 62.2 pthread_mutex_trylock
0.0 1,725 3 575.0 610.0 442 673 119.4 dup
0.0 1,521 1 1,521.0 1,521.0 1,521 1,521 0.0 bind
0.0 744 1 744.0 744.0 744 744 0.0 listen
0.0 415 11 37.7 25.0 25 156 39.3 flockfile
0.0 280 1 280.0 280.0 280 280 0.0 pthread_cond_broadcast
[ 5/11] Executing 'cuda_api_sum' stats report
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- --------------- ------------- -------- -------------- --------------- ----------------------------
97.7 77,073,752,774 55 1,401,340,959.5 13,003.0 12,279 59,489,263,573 8,322,385,353.3 cudaMemcpyAsync
2.2 1,767,642,204 2 883,821,102.0 883,821,102.0 5,839 1,767,636,365 1,249,903,531.6 cudaStreamIsCapturing_v10000
0.0 21,705,495 2 10,852,747.5 10,852,747.5 94,525 21,610,970 15,214,424.2 cudaMallocManaged
0.0 2,908,415 59 49,295.2 5,818.0 3,881 2,471,351 320,858.7 cudaLaunchKernel
0.0 754,773 55 13,723.1 2,392.0 2,199 559,820 75,043.9 cudaStreamSynchronize
0.0 98,876 2 49,438.0 49,438.0 48,821 50,055 872.6 cudaMemGetInfo
[ 6/11] Executing 'cuda_gpu_kern_sum' stats report
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ------------- ------------- ----------- ------------- ------------ ----------------------------------------------------------------------------------------------------
100.0 20,713,721,552 32 647,303,798.5 649,122,493.0 553,444,804 1,029,526,123 85,052,390.6 void at::native::vectorized_elementwise_kernel<(int)4, void at::native::<unnamed>::pow_tensor_scala…
0.0 916,625 7 130,946.4 4,768.0 3,360 827,059 307,367.1 void at::native::<unnamed>::CatArrayBatchedCopy<float, unsigned int, (int)1, (int)128, (int)1>(T1 *…
0.0 9,567 3 3,189.0 2,720.0 2,688 4,159 840.2 void at::native::vectorized_elementwise_kernel<(int)4, void at::native::compare_scalar_kernel<doubl…
0.0 8,384 1 8,384.0 8,384.0 8,384 8,384 0.0 void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp<double, at::native::func_wrap…
0.0 8,224 1 8,224.0 8,224.0 8,224 8,224 0.0 void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp<double, at::native::func_wrap…
0.0 7,008 2 3,504.0 3,504.0 3,168 3,840 475.2 void at::native::vectorized_elementwise_kernel<(int)4, at::native::AbsFunctor<float>, at::detail::A…
0.0 6,976 2 3,488.0 3,488.0 3,360 3,616 181.0 void at::native::vectorized_elementwise_kernel<(int)4, at::native::BinaryFunctor<float, float, bool…
0.0 6,816 2 3,408.0 3,408.0 2,688 4,128 1,018.2 void at::native::vectorized_elementwise_kernel<(int)4, at::native::AUnaryFunctor<float, float, bool…
0.0 5,536 1 5,536.0 5,536.0 5,536 5,536 0.0 void at_cuda_detail::cub::DeviceReduceSingleTileKernel<at_cuda_detail::cub::DeviceReducePolicy<bool…
0.0 5,344 1 5,344.0 5,344.0 5,344 5,344 0.0 void at::native::index_elementwise_kernel<(int)128, (int)4, void at::native::gpu_index_kernel<void …
0.0 4,608 1 4,608.0 4,608.0 4,608 4,608 0.0 void at::native::unrolled_elementwise_kernel<at::native::direct_copy_kernel_cuda(at::TensorIterator…
0.0 4,481 1 4,481.0 4,481.0 4,481 4,481 0.0 void at::native::vectorized_elementwise_kernel<(int)4, at::native::BinaryFunctor<double, double, do…
0.0 4,416 1 4,416.0 4,416.0 4,416 4,416 0.0 void at_cuda_detail::cub::DeviceSelectSweepKernel<at_cuda_detail::cub::DispatchSelectIf<at_cuda_det…
0.0 4,159 1 4,159.0 4,159.0 4,159 4,159 0.0 void at::native::vectorized_elementwise_kernel<(int)4, at::native::BinaryFunctor<bool, bool, bool, …
0.0 4,096 1 4,096.0 4,096.0 4,096 4,096 0.0 void at::native::vectorized_elementwise_kernel<(int)4, at::native::BinaryFunctor<bool, bool, bool, …
0.0 3,584 1 3,584.0 3,584.0 3,584 3,584 0.0 void at::native::vectorized_elementwise_kernel<(int)4, at::native::ceil_kernel_cuda(at::TensorItera…
0.0 2,784 1 2,784.0 2,784.0 2,784 2,784 0.0 void at_cuda_detail::cub::DeviceCompactInitKernel<at_cuda_detail::cub::ScanTileState<int, (bool)1>,…
[ 7/11] Executing 'cuda_gpu_mem_time_sum' stats report
Time (%) Total Time (ns) Count Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Operation
-------- --------------- --------- ---------------- ---------------- -------------- -------------- --------------- ------------------------------------
48.9 40,294,696,926 54 746,198,091.2 2,623.5 2,592 40,294,555,364 5,483,394,084.0 [CUDA memcpy Device-to-Host]
21.3 17,583,477,471 1 17,583,477,471.0 17,583,477,471.0 17,583,477,471 17,583,477,471 0.0 [CUDA memcpy Host-to-Device]
19.4 16,004,681,053 4,054,886 3,947.0 2,495.0 1,599 87,966 4,882.3 [CUDA memcpy Unified Host-to-Device]
10.4 8,560,807,079 52,812 162,099.7 160,446.0 40,575 497,176 18,805.7 [CUDA memcpy Unified Device-to-Host]
[ 8/11] Executing 'cuda_gpu_mem_size_sum' stats report
Total (MB) Count Avg (MB) Med (MB) Min (MB) Max (MB) StdDev (MB) Operation
----------- --------- ---------- ---------- ---------- ---------- ----------- ------------------------------------
110,751.646 52,812 2.097 2.097 0.524 2.097 0.010 [CUDA memcpy Unified Device-to-Host]
95,787.594 4,054,886 0.024 0.008 0.004 1.044 0.058 [CUDA memcpy Unified Host-to-Device]
48,000.000 54 888.889 0.000 0.000 48,000.000 6,531.973 [CUDA memcpy Device-to-Host]
48,000.000 1 48,000.000 48,000.000 48,000.000 48,000.000 0.000 [CUDA memcpy Host-to-Device]
[ 9/11] Executing 'um_sum' stats report
SKIPPED: /mypath/report4.sqlite does not contain CUDA Unified Memory CPU page faults data.
[10/11] Executing 'um_total_sum' stats report
SKIPPED: /mypath/report4.sqlite does not contain CUDA Unified Memory CPU page faults data.
[11/11] Executing 'um_cpu_page_faults_sum' stats report
SKIPPED: /mypath/report4.sqlite does not contain CUDA Unified Memory CPU page faults data.
Generated:
/mypath/report4.nsys-rep
/mypath/report4.sqlite
During the runtime, the output of ‘nvidia-smi’ is as follows:
Wed Jan 31 10:45:32 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:86:00.0 Off | 0 |
| N/A 44C P0 92W / 300W | 32507MiB / 32768MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:AF:00.0 Off | 0 |
| N/A 34C P0 50W / 300W | 2MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 6072 C ...pile_pt112cp39/bin/python 780MiB |
+-----------------------------------------------------------------------------+
-
Version of nsys : NVIDIA Nsight Systems version 2023.4.1.97-234133557503v0
-
There’s a small test experiment with a CUDA program to demonstrate that ‘nsys’ is functioning correctly.
nsys profile --stats=true --cuda-um-gpu-page-faults=true --cuda-um-cpu-page-faults=true --show-output=true --cuda-memory-usage=true ./test_um
The output results are as follows:
Generating '/tmp/nsys-report-b5e4.qdstrm'
[ 1/11] [========================100%] report2.nsys-rep
[ 2/11] [========================100%] report2.sqlite
[ 3/11] Executing 'nvtx_sum' stats report
SKIPPED: /mypath/report2.sqlite does not contain NV Tools Extension (NVTX) data.
[ 4/11] Executing 'osrt_sum' stats report
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ------------ ------------ -------- ---------- ------------ ----------------------
67.8 193,575,574 17 11,386,798.5 10,092,264.0 9,143 53,938,064 14,734,003.6 poll
30.3 86,480,105 643 134,494.7 12,942.0 580 32,854,231 1,559,308.3 ioctl
0.5 1,412,275 30 47,075.8 5,429.5 3,156 1,027,686 185,666.0 mmap64
0.5 1,365,977 9 151,775.2 75,314.0 57,963 756,114 226,941.5 sem_timedwait
0.3 742,155 5 148,431.0 112,438.0 85,588 249,462 72,194.8 pthread_create
0.2 431,535 2 215,767.5 215,767.5 168,184 263,351 67,293.2 sem_wait
0.1 344,250 18 19,125.0 7,211.5 1,724 120,945 31,000.0 mmap
0.1 331,617 54 6,141.1 5,442.0 1,983 20,145 3,044.9 open64
0.1 322,208 37 8,708.3 5,061.0 1,141 136,371 21,698.1 fopen
0.0 79,101 7 11,300.1 3,448.0 2,621 59,681 21,342.0 munmap
0.0 77,166 27 2,858.0 59.0 43 75,636 14,544.8 fgets
0.0 66,345 31 2,140.2 2,255.0 835 3,707 714.1 fclose
0.0 45,552 12 3,796.0 3,857.5 800 6,894 1,790.1 write
0.0 38,236 6 6,372.7 6,021.0 2,245 10,455 3,325.4 open
0.0 32,621 62 526.1 427.5 333 3,607 426.7 fcntl
0.0 32,374 15 2,158.3 2,077.0 1,127 3,431 699.3 read
0.0 30,579 6 5,096.5 4,554.0 116 14,905 5,491.7 fread
0.0 22,053 5 4,410.6 1,744.0 1,125 11,884 4,646.2 putc
0.0 9,807 2 4,903.5 4,903.5 3,284 6,523 2,290.3 socket
0.0 8,433 1 8,433.0 8,433.0 8,433 8,433 0.0 connect
0.0 7,584 1 7,584.0 7,584.0 7,584 7,584 0.0 pipe2
0.0 6,808 10 680.8 169.0 99 4,885 1,483.6 fwrite
0.0 5,931 64 92.7 57.0 52 381 62.1 pthread_mutex_trylock
0.0 1,302 9 144.7 74.0 70 539 155.8 fflush
0.0 1,255 1 1,255.0 1,255.0 1,255 1,255 0.0 bind
0.0 623 1 623.0 623.0 623 623 0.0 listen
0.0 181 1 181.0 181.0 181 181 0.0 pthread_cond_broadcast
[ 5/11] Executing 'cuda_api_sum' stats report
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ------------ --------- -------- ----------- ------------ ---------------------
98.9 153,687,380 3 51,229,126.7 15,429.0 10,096 153,661,855 88,709,345.0 cudaMallocManaged
0.5 853,419 1 853,419.0 853,419.0 853,419 853,419 0.0 cudaDeviceSynchronize
0.4 563,854 1 563,854.0 563,854.0 563,854 563,854 0.0 cudaMemPrefetchAsync
0.2 245,016 3 81,672.0 48,907.0 23,558 172,551 79,717.5 cudaFree
0.0 59,672 1 59,672.0 59,672.0 59,672 59,672 0.0 cudaMemcpyAsync
0.0 29,019 1 29,019.0 29,019.0 29,019 29,019 0.0 cudaLaunchKernel
[ 6/11] Executing 'cuda_gpu_kern_sum' stats report
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- -------- -------- -------- -------- ----------- --------------------------------------------------
100.0 63,135 1 63,135.0 63,135.0 63,135 63,135 0.0 matrix_mul_gpu(int *, int *, int *, int, int, int)
[ 7/11] Executing 'cuda_gpu_mem_time_sum' stats report
Time (%) Total Time (ns) Count Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Operation
-------- --------------- ----- --------- --------- -------- -------- ----------- ------------------------------------
99.0 556,344 1 556,344.0 556,344.0 556,344 556,344 0.0 [CUDA memcpy Host-to-Device]
0.6 3,296 2 1,648.0 1,648.0 1,568 1,728 113.1 [CUDA memcpy Unified Device-to-Host]
0.4 2,336 1 2,336.0 2,336.0 2,336 2,336 0.0 [CUDA memcpy Unified Host-to-Device]
[ 8/11] Executing 'cuda_gpu_mem_size_sum' stats report
Total (MB) Count Avg (MB) Med (MB) Min (MB) Max (MB) StdDev (MB) Operation
---------- ----- -------- -------- -------- -------- ----------- ------------------------------------
0.008 2 0.004 0.004 0.004 0.004 0.000 [CUDA memcpy Unified Device-to-Host]
0.004 1 0.004 0.004 0.004 0.004 0.000 [CUDA memcpy Unified Host-to-Device]
0.000 1 0.000 0.000 0.000 0.000 0.000 [CUDA memcpy Host-to-Device]
[ 9/11] Executing 'um_sum' stats report
Virtual Address HtoD Migration size (MB) DtoH Migration size (MB) CPU Page Faults GPU Page Faults Migration Throughput (MB/s)
--------------- ------------------------ ------------------------ --------------- ---------------- ---------------------------
0x7FE864000000 0.004 0.008 2 2 2,181.808
[10/11] Executing 'um_total_sum' stats report
Total HtoD Migration Size (MB) Total DtoH Migration Size (MB) Total CPU Page Faults Total GPU PageFaults Minimum Virtual Address Maximum Virtual Address
------------------------------ ------------------------------ --------------------- -------------------- ----------------------- -----------------------
0.004 0.008 2 2 0x7FE864000000 0x7FE864000000
[11/11] Executing 'um_cpu_page_faults_sum' stats report
CPU Page Faults CPU Instruction Address
--------------- -----------------------
2 main
Generated:
/mypath/report2.nsys-rep
/mypath/report2.sqlite
This test confirms that ‘nsys’ can accurately capture page faults.
Why can’t capture page faults in a Python program using PyTorch?