How to use Nsight Compute to profile a single CUDA kernel on different processes

Please provide the following info (tick the boxes after creating this topic):
Software Version
DRIVE OS 6.0.10.0
DRIVE OS 6.0.8.1
DRIVE OS 6.0.6
DRIVE OS 6.0.5
DRIVE OS 6.0.4 (rev. 1)
DRIVE OS 6.0.4 SDK
other

Target Operating System
Linux
QNX
other

Hardware Platform
DRIVE AGX Orin Developer Kit (940-63710-0010-300)
DRIVE AGX Orin Developer Kit (940-63710-0010-200)
DRIVE AGX Orin Developer Kit (940-63710-0010-100)
DRIVE AGX Orin Developer Kit (940-63710-0010-D00)
DRIVE AGX Orin Developer Kit (940-63710-0010-C00)
DRIVE AGX Orin Developer Kit (not sure its number)
other

SDK Manager Version
2.1.0
other

Host Machine Version
native Ubuntu Linux 20.04 Host installed with SDK Manager
native Ubuntu Linux 20.04 Host installed with DRIVE OS Docker Containers
native Ubuntu Linux 18.04 Host installed with DRIVE OS Docker Containers
other

Issue Description
Hello,

I am currently trying to profile some custom CUDA kernels on the Orin using Nsight Compute. I copied the command from Nsight Systems and tried to profile. The following is the full command:

ncu --kernel-name <kernel_name> --launch-skip 0 --launch-count 1 <app>

However, when I launch this, ncu told me that no kernels are profiled since they are spawned on other processes. So I added

--target-processes all

The problem now is that the profiler doesn’t even start. I would greatly appreciate if I can have some pointers on this matter!

Dear @extern.ray.xie ,
Are you running ncu on target?
I quickly tested ncu on target like below.
I copied /opt/nvidia/nsight-compute/2021.2.10/target/linux-v4l_l4t-t210-a64 and /opt/nvidia/nsight-compute/2021.2.10/sections folder s from DRIVE OS 6.0.10 docker container to target at ~/2021.2.10 folder

nvidia@tegra-ubuntu:~/2021.2.10$ ls
sections  target
nvidia@tegra-ubuntu:~/2021.2.10$ cd target/
nvidia@tegra-ubuntu:~/2021.2.10/target$ ls
linux-v4l_l4t-t210-a64
nvidia@tegra-ubuntu:~/2021.2.10/target$ cd /usr/local/cuda/samples/0_Simple/matrixMul
nvidia@tegra-ubuntu:/usr/local/cuda/samples/0_Simple/matrixMul$ sudo -s
root@tegra-ubuntu:/usr/local/cuda-11.4/samples/0_Simple/matrixMul# /home/nvidia/2021.2.10/target/linux-v4l_l4t-t210-a64/ncu --kernel-name MatrixMulCUDA --target-processes application-only --launch-skip 0 --launch-count 1 matrixMul
[Matrix Multiply Using CUDA] - Starting...
==PROF== Connected to process 204239 (/usr/local/cuda-11.4/samples/0_Simple/matrixMul/matrixMul)
GPU Device 0: "Ampere" with compute capability 8.7

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
==PROF== Profiling "MatrixMulCUDA": 0%....50%....100% - 9 passes
done
Performance= 617.78 GFlop/s, Time= 0.212 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performancemeasurements. Results may vary when GPU Boost is enabled.
==PROF== Disconnected from process 204239
[204239] matrixMul@127.0.0.1
  void MatrixMulCUDA<(int)32>(float *, float *, float *, int, int), 2025-Apr-09 07:14:57, Context 1, Stream 13
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    SM Frequency                                                             cycle/nsecond                           1.23
    Elapsed Cycles                                                                   cycle                         298988
    Memory [%]                                                                           %                          70.92
    Duration                                                                       usecond                         242.40
    L1/TEX Cache Throughput                                                              %                          74.34
    L2 Cache Throughput                                                                  %                          10.09
    SM Active Cycles                                                                 cycle                      285225.75
    Compute (SM) [%]                                                                     %                          59.27
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   Memory is more heavily utilized than Compute: Look at the Memory Workload Analysis report section to see
          where the memory system bottleneck is. Check memory replay (coalescing) metrics to make sure you're
          efficiently utilizing the bytes transferred. Also consider whether it is possible to do more work per memory
          access (kernel fusion) or whether there are values you can (re)compute.

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                       1024
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         200
    Registers Per Thread                                                   register/thread                             37
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                             Kbyte/block                           8.19
    Threads                                                                         thread                         204800
    Waves Per SM                                                                                                    12.50
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              1
    Block Limit Shared Mem                                                           block                             18
    Block Limit Warps                                                                block                              1
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                          66.47
    Achieved Active Warps Per SM                                                      warp                          31.91
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the number of required registers This kernel's
          theoretical occupancy (66.7%) is limited by the number of warps within each block

Hi Siva,

Thank you for responding. I think profiling a kernel that is on the main process should be okay with ncu (which is the case for the example you have there). However, I am looking to profile kernels that are spawned on different threads by the main process. How can I do that?

In addition, I also don’t see this print out when I close my program.

The used flag is correct. If you notice any error, could you share the details.