Does not contain CUDA Unified Memory CPU page faults data

gm031721 · January 29, 2024, 9:07am

Dear community,
I replaced the CUDA memory allocation from cudaMalloc to cudaMallocManaged in c10/cuda/CUDACachingAllocator.cpp in the PyTorch open-source code, and successfully compiled it. It can be used as expected, and PyTorch version is v1.13.0 .

When training a GNN, I successfully oversubscribed the GPU memory.

When I use nsys to analyze a Python program,
nsys profile --stats=true --cuda-um-gpu-page-faults=true --cuda-um-cpu-page-faults=true --show-output=true python reddit.py
there are no occurrences of any page faults.

Analysis results are as follows:

...
[ 9/11] Executing 'um_sum' stats report
SKIPPED: /mypath/report90.sqlite does not contain CUDA Unified Memory CPU page faults data.
[10/11] Executing 'um_total_sum' stats report
SKIPPED: /mypath/report90.sqlite does not contain CUDA Unified Memory CPU page faults data.
[11/11] Executing 'um_cpu_page_faults_sum' stats report
SKIPPED: /mypath/report90.sqlite does not contain CUDA Unified Memory CPU page faults data.

Why is that?

Thanks.

hwilper · January 30, 2024, 7:23pm

@skottapalli, can you please respond.

skottapalli · January 30, 2024, 10:10pm

Could you add --cuda-memory-usage=true switch to the collection and share the report file?

What is the output of nvidia-smi on the target system?
Which version of nsys are you using?

gm031721 · January 31, 2024, 3:18am

Thank you for your response.

Unfortunately, even after adding --cuda-memory-usage=true, It still cannot collect information on page faults.

nsys profile --stats=true --cuda-um-gpu-page-faults=true --cuda-um-cpu-page-faults=true --show-output=true --cuda-memory-usage=true  python test_um_pow.py

The output results are as follows:
(Sorry, it seems that I cannot share the report file in this window.)

Generating '/tmp/nsys-report-0e4d.qdstrm'
[ 1/11] [========================100%] report4.nsys-rep
[ 2/11] [========================100%] report4.sqlite
[ 3/11] Executing 'nvtx_sum' stats report
SKIPPED: /mypath/report4.sqlite does not contain NV Tools Extension (NVTX) data.
[ 4/11] Executing 'osrt_sum' stats report

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)     Med (ns)    Min (ns)    Max (ns)    StdDev (ns)            Name
 --------  ---------------  ---------  ------------  -----------  ---------  -----------  ------------  ----------------------
     99.9  130,242,996,683      7,406  17,586,145.9  8,104,574.0      2,530  100,236,708  29,916,823.1  poll
      0.1      135,405,696      1,321     102,502.4     17,617.0        583   32,580,810   1,217,796.0  ioctl
      0.0       20,158,404         50     403,168.1      5,523.5      1,167   19,701,557   2,784,969.6  fopen
      0.0       10,120,289          2   5,060,144.5  5,060,144.5  5,043,264    5,077,025      23,872.6  nanosleep
      0.0        7,345,055      1,815       4,046.9      1,258.0        463      226,117      10,697.3  read
      0.0        5,495,187         23     238,921.2    209,435.0     75,966      645,548     143,593.4  pthread_create
      0.0        3,758,690        901       4,171.7      3,924.0      1,861       21,435       1,524.4  open64
      0.0        3,412,070      1,599       2,133.9        185.0         43       92,929      11,607.1  fgets
      0.0        3,365,969         41      82,096.8     83,985.0     56,443       87,929       7,057.8  sleep
      0.0        2,473,778        156      15,857.6      6,166.0      2,784    1,047,080      84,282.8  mmap64
      0.0        1,446,804         10     144,680.4     81,484.0     59,584      739,766     209,296.7  sem_timedwait
      0.0          759,349     26,695          28.4         21.0         18        6,453          72.1  pthread_cond_signal
      0.0          537,717         26      20,681.4      8,349.5      2,228      283,668      54,208.7  mmap
      0.0          217,353         13      16,719.5      9,660.0      1,990      118,902      31,214.9  open
      0.0          146,602         46       3,187.0      2,580.0        831       12,513       2,525.6  fclose
      0.0           82,619         13       6,355.3      6,346.0      2,521       13,869       2,960.3  munmap
      0.0           48,756         13       3,750.5      3,534.0        861        6,906       1,744.2  write
      0.0           39,050         71         550.0        439.0        343        1,809         315.5  fcntl
      0.0           32,866          5       6,573.2      6,620.0      1,325       13,184       5,016.8  putc
      0.0           25,066         68         368.6        329.0         23        1,781         232.8  sigaction
      0.0           22,394          5       4,478.8      3,254.0      1,987        7,749       2,661.4  fopen64
      0.0           21,458         15       1,430.5        156.0         69       17,100       4,348.1  fwrite
      0.0           18,786        140         134.2         71.5         36        1,848         242.9  fflush
      0.0            9,897          2       4,948.5      4,948.5      2,931        6,966       2,853.2  socket
      0.0            9,751          1       9,751.0      9,751.0      9,751        9,751           0.0  connect
      0.0            9,269          2       4,634.5      4,634.5      2,109        7,160       3,571.6  fread
      0.0            7,244          1       7,244.0      7,244.0      7,244        7,244           0.0  pipe2
      0.0            4,901         64          76.6         37.0         29          289          62.2  pthread_mutex_trylock
      0.0            1,725          3         575.0        610.0        442          673         119.4  dup
      0.0            1,521          1       1,521.0      1,521.0      1,521        1,521           0.0  bind
      0.0              744          1         744.0        744.0        744          744           0.0  listen
      0.0              415         11          37.7         25.0         25          156          39.3  flockfile
      0.0              280          1         280.0        280.0        280          280           0.0  pthread_cond_broadcast

[ 5/11] Executing 'cuda_api_sum' stats report

 Time (%)  Total Time (ns)  Num Calls     Avg (ns)        Med (ns)     Min (ns)     Max (ns)       StdDev (ns)                Name
 --------  ---------------  ---------  ---------------  -------------  --------  --------------  ---------------  ----------------------------
     97.7   77,073,752,774         55  1,401,340,959.5       13,003.0    12,279  59,489,263,573  8,322,385,353.3  cudaMemcpyAsync
      2.2    1,767,642,204          2    883,821,102.0  883,821,102.0     5,839   1,767,636,365  1,249,903,531.6  cudaStreamIsCapturing_v10000
      0.0       21,705,495          2     10,852,747.5   10,852,747.5    94,525      21,610,970     15,214,424.2  cudaMallocManaged
      0.0        2,908,415         59         49,295.2        5,818.0     3,881       2,471,351        320,858.7  cudaLaunchKernel
      0.0          754,773         55         13,723.1        2,392.0     2,199         559,820         75,043.9  cudaStreamSynchronize
      0.0           98,876          2         49,438.0       49,438.0    48,821          50,055            872.6  cudaMemGetInfo
[ 6/11] Executing 'cuda_gpu_kern_sum' stats report

 Time (%)  Total Time (ns)  Instances    Avg (ns)       Med (ns)      Min (ns)      Max (ns)     StdDev (ns)                                                   Name
 --------  ---------------  ---------  -------------  -------------  -----------  -------------  ------------  ----------------------------------------------------------------------------------------------------
    100.0   20,713,721,552         32  647,303,798.5  649,122,493.0  553,444,804  1,029,526,123  85,052,390.6  void at::native::vectorized_elementwise_kernel<(int)4, void at::native::<unnamed>::pow_tensor_scala…
      0.0          916,625          7      130,946.4        4,768.0        3,360        827,059     307,367.1  void at::native::<unnamed>::CatArrayBatchedCopy<float, unsigned int, (int)1, (int)128, (int)1>(T1 *…
      0.0            9,567          3        3,189.0        2,720.0        2,688          4,159         840.2  void at::native::vectorized_elementwise_kernel<(int)4, void at::native::compare_scalar_kernel<doubl…
      0.0            8,384          1        8,384.0        8,384.0        8,384          8,384           0.0  void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp<double, at::native::func_wrap…
      0.0            8,224          1        8,224.0        8,224.0        8,224          8,224           0.0  void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp<double, at::native::func_wrap…
      0.0            7,008          2        3,504.0        3,504.0        3,168          3,840         475.2  void at::native::vectorized_elementwise_kernel<(int)4, at::native::AbsFunctor<float>, at::detail::A…
      0.0            6,976          2        3,488.0        3,488.0        3,360          3,616         181.0  void at::native::vectorized_elementwise_kernel<(int)4, at::native::BinaryFunctor<float, float, bool…
      0.0            6,816          2        3,408.0        3,408.0        2,688          4,128       1,018.2  void at::native::vectorized_elementwise_kernel<(int)4, at::native::AUnaryFunctor<float, float, bool…
      0.0            5,536          1        5,536.0        5,536.0        5,536          5,536           0.0  void at_cuda_detail::cub::DeviceReduceSingleTileKernel<at_cuda_detail::cub::DeviceReducePolicy<bool…
      0.0            5,344          1        5,344.0        5,344.0        5,344          5,344           0.0  void at::native::index_elementwise_kernel<(int)128, (int)4, void at::native::gpu_index_kernel<void …
      0.0            4,608          1        4,608.0        4,608.0        4,608          4,608           0.0  void at::native::unrolled_elementwise_kernel<at::native::direct_copy_kernel_cuda(at::TensorIterator…
      0.0            4,481          1        4,481.0        4,481.0        4,481          4,481           0.0  void at::native::vectorized_elementwise_kernel<(int)4, at::native::BinaryFunctor<double, double, do…
      0.0            4,416          1        4,416.0        4,416.0        4,416          4,416           0.0  void at_cuda_detail::cub::DeviceSelectSweepKernel<at_cuda_detail::cub::DispatchSelectIf<at_cuda_det…
      0.0            4,159          1        4,159.0        4,159.0        4,159          4,159           0.0  void at::native::vectorized_elementwise_kernel<(int)4, at::native::BinaryFunctor<bool, bool, bool, …
      0.0            4,096          1        4,096.0        4,096.0        4,096          4,096           0.0  void at::native::vectorized_elementwise_kernel<(int)4, at::native::BinaryFunctor<bool, bool, bool, …
      0.0            3,584          1        3,584.0        3,584.0        3,584          3,584           0.0  void at::native::vectorized_elementwise_kernel<(int)4, at::native::ceil_kernel_cuda(at::TensorItera…
      0.0            2,784          1        2,784.0        2,784.0        2,784          2,784           0.0  void at_cuda_detail::cub::DeviceCompactInitKernel<at_cuda_detail::cub::ScanTileState<int, (bool)1>,…

[ 7/11] Executing 'cuda_gpu_mem_time_sum' stats report

 Time (%)  Total Time (ns)    Count        Avg (ns)          Med (ns)         Min (ns)        Max (ns)       StdDev (ns)                 Operation
 --------  ---------------  ---------  ----------------  ----------------  --------------  --------------  ---------------  ------------------------------------
     48.9   40,294,696,926         54     746,198,091.2           2,623.5           2,592  40,294,555,364  5,483,394,084.0  [CUDA memcpy Device-to-Host]
     21.3   17,583,477,471          1  17,583,477,471.0  17,583,477,471.0  17,583,477,471  17,583,477,471              0.0  [CUDA memcpy Host-to-Device]
     19.4   16,004,681,053  4,054,886           3,947.0           2,495.0           1,599          87,966          4,882.3  [CUDA memcpy Unified Host-to-Device]
     10.4    8,560,807,079     52,812         162,099.7         160,446.0          40,575         497,176         18,805.7  [CUDA memcpy Unified Device-to-Host]

[ 8/11] Executing 'cuda_gpu_mem_size_sum' stats report

 Total (MB)     Count     Avg (MB)    Med (MB)    Min (MB)    Max (MB)   StdDev (MB)               Operation
 -----------  ---------  ----------  ----------  ----------  ----------  -----------  ------------------------------------
 110,751.646     52,812       2.097       2.097       0.524       2.097        0.010  [CUDA memcpy Unified Device-to-Host]
  95,787.594  4,054,886       0.024       0.008       0.004       1.044        0.058  [CUDA memcpy Unified Host-to-Device]
  48,000.000         54     888.889       0.000       0.000  48,000.000    6,531.973  [CUDA memcpy Device-to-Host]
  48,000.000          1  48,000.000  48,000.000  48,000.000  48,000.000        0.000  [CUDA memcpy Host-to-Device]

[ 9/11] Executing 'um_sum' stats report
SKIPPED: /mypath/report4.sqlite does not contain CUDA Unified Memory CPU page faults data.
[10/11] Executing 'um_total_sum' stats report
SKIPPED: /mypath/report4.sqlite does not contain CUDA Unified Memory CPU page faults data.
[11/11] Executing 'um_cpu_page_faults_sum' stats report
SKIPPED: /mypath/report4.sqlite does not contain CUDA Unified Memory CPU page faults data.
Generated:
    /mypath/report4.nsys-rep
    /mypath/report4.sqlite

During the runtime, the output of ‘nvidia-smi’ is as follows:

Wed Jan 31 10:45:32 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:86:00.0 Off |                    0 |
| N/A   44C    P0    92W / 300W |  32507MiB / 32768MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:AF:00.0 Off |                    0 |
| N/A   34C    P0    50W / 300W |      2MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      6072      C   ...pile_pt112cp39/bin/python      780MiB |
+-----------------------------------------------------------------------------+

Version of nsys : NVIDIA Nsight Systems version 2023.4.1.97-234133557503v0
There’s a small test experiment with a CUDA program to demonstrate that ‘nsys’ is functioning correctly.

nsys profile --stats=true --cuda-um-gpu-page-faults=true --cuda-um-cpu-page-faults=true --show-output=true --cuda-memory-usage=true ./test_um

The output results are as follows:

Generating '/tmp/nsys-report-b5e4.qdstrm'
[ 1/11] [========================100%] report2.nsys-rep
[ 2/11] [========================100%] report2.sqlite
[ 3/11] Executing 'nvtx_sum' stats report
SKIPPED: /mypath/report2.sqlite does not contain NV Tools Extension (NVTX) data.
[ 4/11] Executing 'osrt_sum' stats report

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)      Med (ns)    Min (ns)   Max (ns)   StdDev (ns)            Name
 --------  ---------------  ---------  ------------  ------------  --------  ----------  ------------  ----------------------
     67.8      193,575,574         17  11,386,798.5  10,092,264.0     9,143  53,938,064  14,734,003.6  poll
     30.3       86,480,105        643     134,494.7      12,942.0       580  32,854,231   1,559,308.3  ioctl
      0.5        1,412,275         30      47,075.8       5,429.5     3,156   1,027,686     185,666.0  mmap64
      0.5        1,365,977          9     151,775.2      75,314.0    57,963     756,114     226,941.5  sem_timedwait
      0.3          742,155          5     148,431.0     112,438.0    85,588     249,462      72,194.8  pthread_create
      0.2          431,535          2     215,767.5     215,767.5   168,184     263,351      67,293.2  sem_wait
      0.1          344,250         18      19,125.0       7,211.5     1,724     120,945      31,000.0  mmap
      0.1          331,617         54       6,141.1       5,442.0     1,983      20,145       3,044.9  open64
      0.1          322,208         37       8,708.3       5,061.0     1,141     136,371      21,698.1  fopen
      0.0           79,101          7      11,300.1       3,448.0     2,621      59,681      21,342.0  munmap
      0.0           77,166         27       2,858.0          59.0        43      75,636      14,544.8  fgets
      0.0           66,345         31       2,140.2       2,255.0       835       3,707         714.1  fclose
      0.0           45,552         12       3,796.0       3,857.5       800       6,894       1,790.1  write
      0.0           38,236          6       6,372.7       6,021.0     2,245      10,455       3,325.4  open
      0.0           32,621         62         526.1         427.5       333       3,607         426.7  fcntl
      0.0           32,374         15       2,158.3       2,077.0     1,127       3,431         699.3  read
      0.0           30,579          6       5,096.5       4,554.0       116      14,905       5,491.7  fread
      0.0           22,053          5       4,410.6       1,744.0     1,125      11,884       4,646.2  putc
      0.0            9,807          2       4,903.5       4,903.5     3,284       6,523       2,290.3  socket
      0.0            8,433          1       8,433.0       8,433.0     8,433       8,433           0.0  connect
      0.0            7,584          1       7,584.0       7,584.0     7,584       7,584           0.0  pipe2
      0.0            6,808         10         680.8         169.0        99       4,885       1,483.6  fwrite
      0.0            5,931         64          92.7          57.0        52         381          62.1  pthread_mutex_trylock
      0.0            1,302          9         144.7          74.0        70         539         155.8  fflush
      0.0            1,255          1       1,255.0       1,255.0     1,255       1,255           0.0  bind
      0.0              623          1         623.0         623.0       623         623           0.0  listen
      0.0              181          1         181.0         181.0       181         181           0.0  pthread_cond_broadcast

[ 5/11] Executing 'cuda_api_sum' stats report

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)    Med (ns)   Min (ns)   Max (ns)    StdDev (ns)           Name
 --------  ---------------  ---------  ------------  ---------  --------  -----------  ------------  ---------------------
     98.9      153,687,380          3  51,229,126.7   15,429.0    10,096  153,661,855  88,709,345.0  cudaMallocManaged
      0.5          853,419          1     853,419.0  853,419.0   853,419      853,419           0.0  cudaDeviceSynchronize
      0.4          563,854          1     563,854.0  563,854.0   563,854      563,854           0.0  cudaMemPrefetchAsync
      0.2          245,016          3      81,672.0   48,907.0    23,558      172,551      79,717.5  cudaFree
      0.0           59,672          1      59,672.0   59,672.0    59,672       59,672           0.0  cudaMemcpyAsync
      0.0           29,019          1      29,019.0   29,019.0    29,019       29,019           0.0  cudaLaunchKernel

[ 6/11] Executing 'cuda_gpu_kern_sum' stats report

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                         Name
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  --------------------------------------------------
    100.0           63,135          1  63,135.0  63,135.0    63,135    63,135          0.0  matrix_mul_gpu(int *, int *, int *, int, int, int)

[ 7/11] Executing 'cuda_gpu_mem_time_sum' stats report

 Time (%)  Total Time (ns)  Count  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)               Operation
 --------  ---------------  -----  ---------  ---------  --------  --------  -----------  ------------------------------------
     99.0          556,344      1  556,344.0  556,344.0   556,344   556,344          0.0  [CUDA memcpy Host-to-Device]
      0.6            3,296      2    1,648.0    1,648.0     1,568     1,728        113.1  [CUDA memcpy Unified Device-to-Host]
      0.4            2,336      1    2,336.0    2,336.0     2,336     2,336          0.0  [CUDA memcpy Unified Host-to-Device]

[ 8/11] Executing 'cuda_gpu_mem_size_sum' stats report

 Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)               Operation
 ----------  -----  --------  --------  --------  --------  -----------  ------------------------------------
      0.008      2     0.004     0.004     0.004     0.004        0.000  [CUDA memcpy Unified Device-to-Host]
      0.004      1     0.004     0.004     0.004     0.004        0.000  [CUDA memcpy Unified Host-to-Device]
      0.000      1     0.000     0.000     0.000     0.000        0.000  [CUDA memcpy Host-to-Device]

[ 9/11] Executing 'um_sum' stats report

 Virtual Address  HtoD Migration size (MB)  DtoH Migration size (MB)  CPU Page Faults   GPU Page Faults  Migration Throughput (MB/s)
 ---------------  ------------------------  ------------------------  ---------------  ----------------  ---------------------------
 0x7FE864000000                      0.004                     0.008                2                 2                    2,181.808

[10/11] Executing 'um_total_sum' stats report

 Total HtoD Migration Size (MB)  Total DtoH Migration Size (MB)  Total CPU Page Faults  Total GPU PageFaults  Minimum Virtual Address  Maximum Virtual Address
 ------------------------------  ------------------------------  ---------------------  --------------------  -----------------------  -----------------------
                          0.004                           0.008                      2                     2  0x7FE864000000           0x7FE864000000

[11/11] Executing 'um_cpu_page_faults_sum' stats report

 CPU Page Faults  CPU Instruction Address
 ---------------  -----------------------
               2  main

Generated:
    /mypath/report2.nsys-rep
    /mypath/report2.sqlite

This test confirms that ‘nsys’ can accurately capture page faults.

Why can’t capture page faults in a Python program using PyTorch?

skottapalli · January 31, 2024, 4:00am

Thank you. I will need the nsys-rep file at a minimum to debug. Please send it via DM, if you can’t send it here.

How do you know for sure that there are page faults occurring during the execution of your app? I see cuda unified memcpy D2H and H2D in the stats output that are directly from using cudaMallocManaged API. If you share the report, I can check if any errors have occurred during the collection or if any of the unified memcpy were due to page faults.

gm031721 · January 31, 2024, 8:27am

Thank you for your prompt response.
This window, including the DM window, doesn’t support directly sharing nsys-rep files.
The files are available at the following OneDrive link.
2

I’m not certain that running this program will definitely incur page faults, but given the heavy workload the program handles, there is a high probability of encountering page faults. This is my understanding of a characteristic of Unified Memory (UM).

Thank you for your assistance.

skottapalli · February 15, 2024, 12:16am

Hello, sorry for the delayed response. I looked at the report file in the nsys GUI. There are a ton of GPU page faults, but no CPU page fault events.

All the D2H transfers under memory row are due to “eviction”. It seems expected that you would see no CPU page faults.

For H2D transfers, I see a ton of H2D transfers under the memory row which are due to “page faults”, so it is expected to a lot of GPU page faults.

gm031721 · February 24, 2024, 8:27am

Thank you for your response.
It seems that the nsys command line cannot capture page faults, while they can be seen in the nsys GUI.
Is it my nsys command incorrect, or is it due to the version of nsys?
Thanks.

skottapalli · February 26, 2024, 2:47am

Nsys is able to capture the page faults. There are two kinds of page faults: GPU page faults and CPU page faults. Your application has only GPU page faults. We don’t have any CLI output summaries for these. Your app does not have any CPU page faults.

gm031721 · February 29, 2024, 5:25am

Thank you for all your patient explanations above. I understand what you mean now.

hwilper · February 29, 2024, 2:17pm

We don’t have any existing CLI output summaries for those, but you can use the sqlite export functionality to extract them yourself. The data is in the CUDA_UM_GPU_PAGE_FAULT_EVENTS table

gm031721 · March 6, 2024, 12:01pm

Thank you for your reply.

I am using the nsys GUI to analyze the .nsys-rep file.
How can I get the total number of page faults that occurred during the program’s execution?

hwilper · March 6, 2024, 5:37pm

@jkreibich can help you.

jkreibich · March 6, 2024, 6:09pm

As @hwilper mentioned, you can export the .nsys-rep file to SQLite or similar user- accessible file format. The nsys stats system actually uses SQLite to perform most of the report generation via small Python files. If you’re interested in modifying one of the reports, you can find them in <nsys install>/reports.

Exports can be done with a command such as nsys export --type=sqlite <nsys-rep file>. The SQLite file can be accessed with the sqlite3 CLI tool available on most systems, or you can download/install a GUI tool such as SQLite Studio. Other supported export formats include HDF5, Arrow, and Parquet. If you’re not familiar with SQL, many of these formats can be easily imported into something like Pandas.

If you really just want the total number of different events, something like this should work (I don’t have a valid file to test this with, so let me know if there is an issue):

SELECT t.name AS type, sum(e.numberOfPageFaults) AS faults FROM CUDA_UM_GPU_PAGE_FAULT_EVENTS AS e LEFT JOIN ENUM_CUDA_UNIF_MEM_ACCESS_TYPE AS t ON (e.faultAccessType == t.id) GROUP BY 1;

You can run this inside the sqlite3 shell, or with a command like sqlite3 <export.sqlite> '<sql>'

gm031721 · March 8, 2024, 6:49am

Thank you very much.
By using the commands above, I was able to successfully output the number of GPU page faults from the CUDA_UM_GPU_PAGE_FAULT_EVENTS table in the sqlite3 shell.

sqlite> SELECT t.name AS type, sum(e.numberOfPageFaults) AS faults FROM CUDA_UM_GPU_PAGE_FAULT_EVENTS AS e LEFT JOIN ENUM_CUDA_UNIF_MEM_ACCESS_TYPE AS t ON (e.faultAccessType == t.id) GROUP BY 1;
CUDA_UNIFIED_MEMORY_ACCESS_TYPE_READ|676431
CUDA_UNIFIED_MEMORY_ACCESS_TYPE_WRITE|122956
sqlite>

I also wanted to ask what the ENUM_CUDA_UNIF_MEM_ACCESS_TYPE table represents, as shown below.

sqlite> select * from ENUM_CUDA_UNIF_MEM_ACCESS_TYPE;
0|CUDA_UNIFIED_MEMORY_ACCESS_TYPE_UNKNOWN|Unknown
1|CUDA_UNIFIED_MEMORY_ACCESS_TYPE_READ|Read
2|CUDA_UNIFIED_MEMORY_ACCESS_TYPE_WRITE|Write
3|CUDA_UNIFIED_MEMORY_ACCESS_TYPE_ATOMIC|Atomic
4|CUDA_UNIFIED_MEMORY_ACCESS_TYPE_PREFETCH|Prefetch
sqlite>

I only found its definition in the official UserGuide. Where can I find more information about this table and its descriptions?

Thank you.

jkreibich · March 8, 2024, 7:29am

The ENUM_CUDA_UNIF_MEM_ACCESS_TYPE table values values come from the CUpti_ActivityUnifiedMemoryAccessType CUDA/CUPTI enum. Docs can be found at here.

The different values describe the type of GPU memory access that triggered the page fault.

gm031721 · March 8, 2024, 8:19am

I understand.
Thank you everyone for patiently answering my questions.
The answers above have been very helpful to me.

Topic		Replies	Views
Nsys is not collecting kernel data Profiling Linux Targets nsight , wsl	31	7332	March 14, 2025
Sqlite does not contain CUDA kernel data CUDA on Windows Subsystem for Linux	12	3548	April 28, 2023
'osrt_sum' stats report no data available Profiling x86 Windows Targets	4	232	June 25, 2024
Generating CUPTI_* tables with nsys Profiling Linux Targets cuda	25	1666	January 12, 2023
Problem about time of copy data through shared memory CUDA Programming and Performance	4	356	March 30, 2024
Performances of multi-thread vs multi-process with MPS CUDA Programming and Performance	2	3032	August 20, 2018
Nsys profile mpirun -np 1 ./MyOpenACC_App ./input.file has float point error Profiling Linux Targets cuda	16	880	November 28, 2023
cudaMemcpyAsync execution before and after Level 1 cuBLAS kernel calls nvc, nvc++ and nvfortran cuda	7	100	October 29, 2024
Unifying the CUDA Python Ecosystem Technical Blog cuda , python	11	1668	October 19, 2021
2x slower kernel if the inner dimension is divsible by 16/32 CUDA Programming and Performance cuda , kernel	13	56	July 26, 2024

Does not contain CUDA Unified Memory CPU page faults data

Related topics