Not getting Cuda Memory Operations Statistics

nobutaka · April 29, 2023, 5:18am

I’m taking the accelerated computing course and trying to do the page faults module but dont’ get this output online

Warning: LBR backtrace method is not supported on this platform. DWARF backtrace method will be used.
WARNING: The command line includes a target application therefore the CPU context-switch scope has been set to process-tree.
Collecting data...
Processing events...
Saving temporary "/tmp/nsys-report-41a4-313b-b3f8-5862.qdstrm" file to disk...

Creating final output files...
Processing [==============================================================100%]
Saved report file to "/tmp/nsys-report-41a4-313b-b3f8-5862.qdrep"
Exporting 1060 events: [==================================================100%]

Exported successfully to
/tmp/nsys-report-41a4-313b-b3f8-5862.sqlite


CUDA API Statistics:

 Time(%)  Total Time (ns)  Num Calls    Average     Minimum    Maximum           Name         
 -------  ---------------  ---------  -----------  ---------  ---------  ---------------------
    91.2        261802252          1  261802252.0  261802252  261802252  cudaMallocManaged    
     6.7         19130645          1   19130645.0   19130645   19130645  cudaDeviceSynchronize
     2.1          6008423          1    6008423.0    6008423    6008423  cudaFree             
     0.0            37687          1      37687.0      37687      37687  cudaLaunchKernel     



CUDA Kernel Statistics:

 Time(%)  Total Time (ns)  Instances   Average    Minimum   Maximum            Name          
 -------  ---------------  ---------  ----------  --------  --------  -----------------------
   100.0         19121415          1  19121415.0  19121415  19121415  deviceKernel(int*, int)



Operating System Runtime API Statistics:

 Time(%)  Total Time (ns)  Num Calls   Average    Minimum   Maximum              Name           
 -------  ---------------  ---------  ----------  -------  ---------  --------------------------
    69.0        360451138         20  18022556.9    48579  100127962  poll                      
    21.3        111434090        666    167318.5     1012   18594843  ioctl                     
     7.5         39163519         16   2447719.9    13751   20907532  sem_timedwait             
     1.7          8672010         92     94261.0     1252    5869931  mmap                      
     0.4          2021984         82     24658.3     4659      49522  open64                    
     0.0           186682          4     46670.5    31469      64385  pthread_create            
     0.0           167157          3     55719.0    53419      60267  fgets                     
     0.0           140769         25      5630.8     1511      24129  fopen                     
     0.0           106346         11      9667.8     4090      14039  write                     
     0.0            40206         27      1489.1     1058       6546  fcntl                     
     0.0            34644          7      4949.1     2445       8483  munmap                    
     0.0            34090          5      6818.0     3832       9746  open                      
     0.0            28303         18      1572.4     1020       5215  fclose                    
     0.0            27899          5      5579.8     1092       7321  pthread_rwlock_timedwrlock
     0.0            25177          2     12588.5     8203      16974  socket                    
     0.0            24503         12      2041.9     1092       4180  read                      
     0.0            22936          5      4587.2     1171      10734  fgetc                     
     0.0            14289          1     14289.0    14289      14289  pipe2                     
     0.0             9440          4      2360.0     1885       2856  mprotect                  
     0.0             9010          2      4505.0     3830       5180  fread                     
     0.0             8827          1      8827.0     8827       8827  connect                   
     0.0             2786          1      2786.0     2786       2786  bind                      
     0.0             1925          1      1925.0     1925       1925  listen                    

Report file moved to "/dli/task/report5.qdrep"
Report file moved to "/dli/task/report5.sqlite"

Earlier for vector add, in the same notebook, I was getting the Cuda Memory stats per below

CUDA API Statistics:

 Time(%)  Total Time (ns)  Num Calls    Average      Minimum     Maximum            Name         
 -------  ---------------  ---------  ------------  ----------  ----------  ---------------------
    88.8       2315390328          1  2315390328.0  2315390328  2315390328  cudaDeviceSynchronize
    10.4        271472573          3    90490857.7       19411   271376306  cudaMallocManaged    
     0.8         21322764          3     7107588.0     6343104     8428518  cudaFree             
     0.0            46645          1       46645.0       46645       46645  cudaLaunchKernel     



CUDA Kernel Statistics:

 Time(%)  Total Time (ns)  Instances    Average      Minimum     Maximum                       Name                    
 -------  ---------------  ---------  ------------  ----------  ----------  -------------------------------------------
   100.0       2315434815          1  2315434815.0  2315434815  2315434815  addVectorsInto(float*, float*, float*, int)



CUDA Memory Operation Statistics (by time):

 Time(%)  Total Time (ns)  Operations  Average  Minimum  Maximum              Operation            
 -------  ---------------  ----------  -------  -------  -------  ---------------------------------
    76.5         68296926        2304  29642.8     1886   177502  [CUDA Unified Memory memcpy HtoD]
    23.5         20983319         768  27322.0     1119   165278  [CUDA Unified Memory memcpy DtoH]



CUDA Memory Operation Statistics (by size in KiB):

   Total     Operations  Average  Minimum  Maximum               Operation            
 ----------  ----------  -------  -------  --------  ---------------------------------
 393216.000        2304  170.667    4.000  1020.000  [CUDA Unified Memory memcpy HtoD]
 131072.000         768  170.667    4.000  1020.000  [CUDA Unified Memory memcpy DtoH]

Notebook found here in paid courses: Jupyter Notebook

nobutaka · April 29, 2023, 6:43am

What’s strange is that above I am only calling the device kernel function so it does not show the Cuda memory stats but when I call the host function as well, it does show everything per below:

CUDA API Statistics:

 Time(%)  Total Time (ns)  Num Calls    Average     Minimum    Maximum           Name         
 -------  ---------------  ---------  -----------  ---------  ---------  ---------------------
    94.3        436247029          1  436247029.0  436247029  436247029  cudaMallocManaged    
     3.9         17858782          1   17858782.0   17858782   17858782  cudaDeviceSynchronize
     1.8          8542833          1    8542833.0    8542833    8542833  cudaFree             
     0.0            63322          1      63322.0      63322      63322  cudaLaunchKernel     



CUDA Kernel Statistics:

 Time(%)  Total Time (ns)  Instances   Average    Minimum   Maximum            Name          
 -------  ---------------  ---------  ----------  --------  --------  -----------------------
   100.0         17856214          1  17856214.0  17856214  17856214  deviceKernel(int*, int)



CUDA Memory Operation Statistics (by time):

 Time(%)  Total Time (ns)  Operations  Average  Minimum  Maximum              Operation            
 -------  ---------------  ----------  -------  -------  -------  ---------------------------------
   100.0         21303922         768  27739.5     1599   173214  [CUDA Unified Memory memcpy DtoH]



CUDA Memory Operation Statistics (by size in KiB):

   Total     Operations  Average  Minimum  Maximum               Operation            
 ----------  ----------  -------  -------  --------  ---------------------------------
 131072.000         768  170.667    4.000  1020.000  [CUDA Unified Memory memcpy DtoH]

I get it now, it is recording the speed of memory migration from host to device and back.