Nsys command line on agx pegasus

I’m trying to use nsys command line to get memory usage of an application on agx drive pegasus (following this forum post)

Running latest arm build on agx pegasus (2021.4.1)
nsys profile --stats=true -t cuda -b fp --cudabacktrace=all --cuda-memory-usage=true /usr/loca/driveworks/bin/sample_hello_world
gives error
WARNING: CPU sampling requires root privileges, disabling
WARNING: CUDA backtraces will not be collected because CPU sampling is disabled
and produces no stats.
On the other hand running the exact same command on development laptop gives stats and no warnings.
I have tried running with sudo but this leads to a segmenation fault on target.

Addendum
Just using
nsys profile --stats=true /usr/loca/driveworks/bin/sample_hello_world
gives same error message on target. But it works without error messages on development laptop.

moved this topic to DRIVE AGX General - NVIDIA Developer Forums.

Hi, @iejs

Please refer to Nsight nsys cannot collect cuda information. Thanks.

Thanks for your reply VickNV.

I note that the solution is ambiguous as to what actually worked. The poster did two things:
i) reinstall drive os 5.2.6
ii) manually installed nsight command line for arm 2021.2.1 (as did not find nsys on target).

It is unclear which of these actually solved the problem, or if it was the combination.

I am restricted to using drive os 5.1.6 and drive software 10.0.

I tried using version 2021.2.1 nsight client for arm but with no success in profiling CUDA (and similiar error messages).

Can you confirm that correct version of nsight should be installed with drive os 5.1.6 and drive software 10.0?

Dear @iejs,
Can you confirm that correct version of nsight should be installed with drive os 5.1.6 and drive software 10.0?

It is 2019.3

Thanks for your reply.

This appears to be the version which runs on the host development machine which I have.

I mean the command line version which runs on the target.

(And I note, that no arm version 2019.3 is available from nsight downloads page).

For DRIVE Software 10, please refer to Nsys cannot collect cuda information on Drive OS 5.1.
You better start it from using the version installed by SDK Manager along with the release.

@VickNV this actually is a Nsight Systems question if you want to move it back to that topic.

@iejs can you run “nsys status -e” on the AGX and let me know what it says? This will help us figure out if you really have a permissions issue.

@VicnNV this post is very long, and you did not link to a particular part of it. Moreover it is not at all clear that that post answers my question which is: what is the command line version of nsight is installed on the target for drive os 5.1.6.

So far I have had to manually install an arm command line version on the target - I wanted to know what version is meant to be installed.

@hwilper

Here is the terminal output of a session, where I do the following:

  1. I get version of nsys
  2. I set perf_event_paranoid to 1
  3. I run nsys status -e
  4. I try to profile sample_hello_world
  5. It then complains about privileges.
nvidia@tegra-ubuntu:~$ nsys --version
NVIDIA Nsight Systems version 2021.1.1.66-6c5c5cb

nvidia@tegra-ubuntu:~$ sudo sh -c 'echo 1 >/proc/sys/kernel/perf_event_paranoid'

nvidia@tegra-ubuntu:~$ nsys status -e
Timestamp counter supported: Yes
Sampling Environment Check
Linux Kernel Paranoid Level = 1: OK
Linux Distribution = Ubuntu
Linux Kernel Version = 4.14.102-rt53-tegra: OK
Linux perf_event_open syscall available: OK
Sampling trigger event available: OK
Intel(c) Last Branch Record support: Not Available
Root privileges: No
Kernel module: Available
Sampling Environment: OK

nvidia@tegra-ubuntu:~$ nsys profile --stats=true /usr/local/driveworks/bin/sample_hello_world 
WARNING: CPU sampling requires root privileges, disabling.
WARNING: Backtraces will not be collected because sampling is disabled.
WARNING: 'timer' backtrace collection trigger will not be used because sampling is disabled.
WARNING: 'sched' backtrace collection trigger will not be used because sampling is disabled.
Collecting data...
*************************************************
Welcome to Driveworks SDK
[27-10-2021 09:45:32] Platform: Detected DDPX - Tegra A
[27-10-2021 09:45:32] TimeSource: monotonic epoch time offset is 1635287828840403
[27-10-2021 09:45:32] PTP Time is available from NVPPS Driver
[27-10-2021 09:45:35] Platform: number of GPU devices detected 2
[27-10-2021 09:45:35] Platform: currently selected GPU device discrete ID 0
[27-10-2021 09:45:35] SDK: Resources mounted from /usr/local/driveworks-2.2/data/
[27-10-2021 09:45:35] SDK: Create NvMediaDevice
[27-10-2021 09:45:35] egl::Display: found 2 EGL devices
[27-10-2021 09:45:35] egl::Display: use drm device: drm-nvdc
[27-10-2021 09:45:36] TimeSource: monotonic epoch time offset is 1635287828840403
[27-10-2021 09:45:36] PTP Time is available from NVPPS Driver
[27-10-2021 09:45:36] Initialize DriveWorks SDK v2.2.3136
[27-10-2021 09:45:36] Release build with GNU 7.3.1 from heads/buildbrain-branch-0-gca7b4b26e65 against Drive PDK v5.1.6.1
Context of Driveworks SDK successfully initialized.
Version: 2.2.3136
GPU devices detected: 2
[27-10-2021 09:45:36] Platform: currently selected GPU device discrete ID 0
----------------------------------------------
Device: 0, Graphics Device
CUDA Driver Version / Runtime Version : 10.2 / 10.2
CUDA Capability Major/Minor version number: 7.5
Total amount of global memory in MBytes:7679.94
Memory Clock rate Khz: 1440000
Memory Bus Width bits: 256
L2 Cache Size: 4194304
Maximum 1D Texture Dimension Size (x): 131072
Maximum 2D Texture Dimension Size (x,y): 131072, 65536
Maximum 3D Texture Dimension Size (x,y,z): 16384, 16384, 16384
Maximum Layered 1D Texture Size, (x): 32768 num: 2048
Maximum Layered 2D Texture Size, (x,y): 32768, 32768 num: 2048
Total amount of constant memory bytes: 65536
Total amount of shared memory per block bytes: 49152
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): 1024,1024,64
Max dimension size of a grid size (x,y,z): 2147483647,65535,65535
Maximum memory pitch bytes: 2147483647
Texture alignment bytes: 512
Concurrent copy and kernel execution: Yes, copy engines num: 3
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID: 1, Device PCI Bus ID: 1, Device PCI location ID: 0
Compute Mode: Default (multiple host threads can use ::cudaSetDevice() with device simultaneously)
Concurrent kernels: 1
Concurrent memory: 0

[27-10-2021 09:45:36] Platform: currently selected GPU device integrated ID 1
----------------------------------------------
Device: 1, Xavier
CUDA Driver Version / Runtime Version : 10.2 / 10.2
CUDA Capability Major/Minor version number: 7.2
Total amount of global memory in MBytes:27924.1
Memory Clock rate Khz: 1109000
Memory Bus Width bits: 256
L2 Cache Size: 524288
Maximum 1D Texture Dimension Size (x): 131072
Maximum 2D Texture Dimension Size (x,y): 131072, 65536
Maximum 3D Texture Dimension Size (x,y,z): 16384, 16384, 16384
Maximum Layered 1D Texture Size, (x): 32768 num: 2048
Maximum Layered 2D Texture Size, (x,y): 32768, 32768 num: 2048
Total amount of constant memory bytes: 65536
Total amount of shared memory per block bytes: 49152
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): 1024,1024,64
Max dimension size of a grid size (x,y,z): 2147483647,65535,65535
Maximum memory pitch bytes: 2147483647
Texture alignment bytes: 512
Concurrent copy and kernel execution: Yes, copy engines num: 1
Run time limit on kernels: No
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID: 0, Device PCI Bus ID: 0, Device PCI location ID: 0
Compute Mode: Default (multiple host threads can use ::cudaSetDevice() with device simultaneously)
Concurrent kernels: 1
Concurrent memory: 0

[27-10-2021 09:45:36] Releasing Driveworks SDK Context
[27-10-2021 09:45:36] SDK: Release NvMediaDevice
[27-10-2021 09:45:36] SDK: Release NvMedia2D
Happy autonomous driving!
Processing events...
Saving temporary "/tmp/nsys-report-a54f-91de-ec85-a25e.qdstrm" file to disk...

Creating final output files...
Processing [==============================================================100%]
Saved report file to "/tmp/nsys-report-a54f-91de-ec85-a25e.qdrep"
Exporting 20389 events: [=================================================100%]

Exported successfully to
/tmp/nsys-report-a54f-91de-ec85-a25e.sqlite


Operating System Runtime API Statistics:

 Time(%)  Total Time (ns)  Num Calls    Average      Minimum     Maximum             Name         
 -------  ---------------  ---------  ------------  ----------  ----------  ----------------------
    75.9       9828574656          8  1228571832.0  1227597664  1229691488  pthread_cond_wait     
    13.4       1731574752         26    66599028.9        6720   522776064  poll                  
     4.9        630196896        602     1046838.7         992   420651872  ioctl                 
     4.0        513571712         11    46688337.5    17980832    91741312  system                
     0.8        103936448         10    10393644.8        3520   100133792  pthread_join          
     0.8         99899616          1    99899616.0    99899616    99899616  sem_timedwait         
     0.2         24564160      18272        1344.4         992     1058176  read                  
     0.0          3984032         36      110667.6        5920     2254208  open                  
     0.0          1868160         11      169832.7      111648      224672  pthread_create        
     0.0          1581728         81       19527.5        4096       94656  mmap                  
     0.0          1554016         97       16020.8        5824       79008  write                 
     0.0          1205920        782        1542.1        1088       90912  sched_yield           
     0.0          1029824         11       93620.4       36032      142016  sem_wait              
     0.0           808576         42       19251.8        4416      126400  fopen                 
     0.0           754688         67       11264.0        1120      109184  fgets                 
     0.0           445728         76        5864.8        2304       65920  putc                  
     0.0           432032          9       48003.6       16384      115264  fopen64               
     0.0           350080         19       18425.3        1152       47552  fflush                
     0.0           313824         21       14944.0        5600       34848  munmap                
     0.0           257504         20       12875.2        4640       66528  fclose                
     0.0           222048         63        3524.6        1120       42816  sigaction             
     0.0           212064          3       70688.0       14208      117120  writev                
     0.0           207296         27        7677.6        1056       91808  fputs                 
     0.0           159584         10       15958.4        1024       52448  pthread_cond_broadcast
     0.0           136704          3       45568.0       27232       77856  connect               
     0.0           125120          6       20853.3       15680       24832  socket                
     0.0           124352          4       31088.0       16288       45952  fread                 
     0.0            86464         28        3088.0        1536        6208  fcntl                 
     0.0            80768          7       11538.3        1152       36480  fwrite                
     0.0            65920          1       65920.0       65920       65920  open64                
     0.0            45344          3       15114.7       12224       18240  pipe2                 
     0.0            26176          2       13088.0        3744       22432  fgets_unlocked        
     0.0            17696          2        8848.0        6400       11296  recvfrom              

Report file moved to "/home/nvidia/report2.qdrep"
Report file moved to "/home/nvidia/report2.sqlite"

would you be willing to update to 2021.4?

@hwilper
if you read the beginning of my post you’ll notice I used 2021.4.1.
Unless of course you mean something else, in which case could you be more specific.

You did say 2021.4 at the beginning, but the terminal output you included above says you are running:

nvidia@tegra-ubuntu:~$ nsys --version
NVIDIA Nsight Systems version 2021.1.1.66-6c5c5cb

Sorry for my confusion.

@Andrey_Trachenko cna you help out with the version.

@hwilper
Just to confirm I have tried both version 2021.2.1 and 2021.4.1, both with the same issue. And I have been unable to locate a 2019.3 arm version which was suggested by @SivaRamaKrishnaNV.