We can find the ncu binary on the Orin.
Please give it a check.
$ /opt/nvidia/nsight-compute/2022.2.1
$ sudo ./ncu /usr/local/cuda-11.4/samples/0_Simple/vectorAdd/vectorAdd
[Vector addition of 50000 elements]
==PROF== Connected to process 21604 (/usr/local/cuda-11.4/samples/0_Simple/vectorAdd/vectorAdd)
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
==PROF== Profiling "vectorAdd" - 0: 0%....50%....100% - 9 passes
Copy output data from the CUDA device to the host memory
Test PASSED
Done
==PROF== Disconnected from process 21604
[21604] vectorAdd@127.0.0.1
vectorAdd(const float *, const float *, float *, int), 2022-Apr-24 21:20:32, Context 1, Stream 7
Section: GPU Speed Of Light Throughput
---------------------------------------------------------------------- --------------- ------------------------------
SM Frequency cycle/usecond 479.98
Elapsed Cycles cycle 8463
Memory [%] % 31.71
Duration usecond 17.63
...
This is weird. My /opt/nivida/ has nsight system and nsight graphics but not nsight compute.
Anyway, I found a workaround. During a remote profiling, the host machine deploy nsight compute binaries under /tmp/var/ (at least for me). After that, the folder is not deleted. The binary is /tmp/var/target/linux-v4l_l4t-t210-a64/ncu. So I just copied the entire folder to somewhere else and now I can profile locally.