How to use Nsight Compute to profile a single CUDA kernel on different processes

extern.ray.xie · April 8, 2025, 6:07pm

Please provide the following info (tick the boxes after creating this topic):
Software Version
DRIVE OS 6.0.10.0
DRIVE OS 6.0.8.1
DRIVE OS 6.0.6
DRIVE OS 6.0.5
DRIVE OS 6.0.4 (rev. 1)
DRIVE OS 6.0.4 SDK
other

Target Operating System
Linux
QNX
other

Hardware Platform
DRIVE AGX Orin Developer Kit (940-63710-0010-300)
DRIVE AGX Orin Developer Kit (940-63710-0010-200)
DRIVE AGX Orin Developer Kit (940-63710-0010-100)
DRIVE AGX Orin Developer Kit (940-63710-0010-D00)
DRIVE AGX Orin Developer Kit (940-63710-0010-C00)
DRIVE AGX Orin Developer Kit (not sure its number)
other

SDK Manager Version
2.1.0
other

Host Machine Version
native Ubuntu Linux 20.04 Host installed with SDK Manager
native Ubuntu Linux 20.04 Host installed with DRIVE OS Docker Containers
native Ubuntu Linux 18.04 Host installed with DRIVE OS Docker Containers
other

Issue Description
Hello,

I am currently trying to profile some custom CUDA kernels on the Orin using Nsight Compute. I copied the command from Nsight Systems and tried to profile. The following is the full command:

ncu --kernel-name <kernel_name> --launch-skip 0 --launch-count 1 <app>

However, when I launch this, ncu told me that no kernels are profiled since they are spawned on other processes. So I added

--target-processes all

The problem now is that the profiler doesn’t even start. I would greatly appreciate if I can have some pointers on this matter!

SivaRamaKrishnaNV · April 9, 2025, 7:24am

Dear @extern.ray.xie ,
Are you running ncu on target?
I quickly tested ncu on target like below.
I copied /opt/nvidia/nsight-compute/2021.2.10/target/linux-v4l_l4t-t210-a64 and /opt/nvidia/nsight-compute/2021.2.10/sections folder s from DRIVE OS 6.0.10 docker container to target at ~/2021.2.10 folder

nvidia@tegra-ubuntu:~/2021.2.10$ ls
sections  target
nvidia@tegra-ubuntu:~/2021.2.10$ cd target/
nvidia@tegra-ubuntu:~/2021.2.10/target$ ls
linux-v4l_l4t-t210-a64
nvidia@tegra-ubuntu:~/2021.2.10/target$ cd /usr/local/cuda/samples/0_Simple/matrixMul
nvidia@tegra-ubuntu:/usr/local/cuda/samples/0_Simple/matrixMul$ sudo -s
root@tegra-ubuntu:/usr/local/cuda-11.4/samples/0_Simple/matrixMul# /home/nvidia/2021.2.10/target/linux-v4l_l4t-t210-a64/ncu --kernel-name MatrixMulCUDA --target-processes application-only --launch-skip 0 --launch-count 1 matrixMul
[Matrix Multiply Using CUDA] - Starting...
==PROF== Connected to process 204239 (/usr/local/cuda-11.4/samples/0_Simple/matrixMul/matrixMul)
GPU Device 0: "Ampere" with compute capability 8.7

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
==PROF== Profiling "MatrixMulCUDA": 0%....50%....100% - 9 passes
done
Performance= 617.78 GFlop/s, Time= 0.212 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performancemeasurements. Results may vary when GPU Boost is enabled.
==PROF== Disconnected from process 204239
[204239] matrixMul@127.0.0.1
  void MatrixMulCUDA<(int)32>(float *, float *, float *, int, int), 2025-Apr-09 07:14:57, Context 1, Stream 13
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    SM Frequency                                                             cycle/nsecond                           1.23
    Elapsed Cycles                                                                   cycle                         298988
    Memory [%]                                                                           %                          70.92
    Duration                                                                       usecond                         242.40
    L1/TEX Cache Throughput                                                              %                          74.34
    L2 Cache Throughput                                                                  %                          10.09
    SM Active Cycles                                                                 cycle                      285225.75
    Compute (SM) [%]                                                                     %                          59.27
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   Memory is more heavily utilized than Compute: Look at the Memory Workload Analysis report section to see
          where the memory system bottleneck is. Check memory replay (coalescing) metrics to make sure you're
          efficiently utilizing the bytes transferred. Also consider whether it is possible to do more work per memory
          access (kernel fusion) or whether there are values you can (re)compute.

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                       1024
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                         200
    Registers Per Thread                                                   register/thread                             37
    Shared Memory Configuration Size                                                 Kbyte                          16.38
    Driver Shared Memory Per Block                                             Kbyte/block                           1.02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                             Kbyte/block                           8.19
    Threads                                                                         thread                         204800
    Waves Per SM                                                                                                    12.50
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             16
    Block Limit Registers                                                            block                              1
    Block Limit Shared Mem                                                           block                             18
    Block Limit Warps                                                                block                              1
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                          66.67
    Achieved Occupancy                                                                   %                          66.47
    Achieved Active Warps Per SM                                                      warp                          31.91
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (66.7%) is limited by the number of required registers This kernel's
          theoretical occupancy (66.7%) is limited by the number of warps within each block

extern.ray.xie · April 9, 2025, 6:41pm

Hi Siva,

Thank you for responding. I think profiling a kernel that is on the main process should be okay with ncu (which is the case for the example you have there). However, I am looking to profile kernels that are spawned on different threads by the main process. How can I do that?

In addition, I also don’t see this print out when I close my program.

SivaRamaKrishnaNV · April 10, 2025, 7:53pm

The used flag is correct. If you notice any error, could you share the details.

system · May 2, 2025, 6:59am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Nsight-Compute returns “No kernels were profiled” warning Nsight Compute	9	1448	July 27, 2023
Run ncu command in ubuntu 20.04 Nsight Compute	7	5389	August 8, 2022
How can I prevent my customized CUDA kernel function from using tensor cores on a Jetson Orin device? Jetson AGX Orin cuda , kernel	19	1005	February 5, 2024
==ERROR== Failed to prepare kernel for profiling (0xc00000fd) but CUDA sample works Nsight Compute kernel , nvbugs	13	2053	November 6, 2021
Nsight Compute crash with message: free(): invalid pointer Nsight Compute cuda , nsight	36	650	March 20, 2025
Option to profile only master process Nsight Compute cuda	23	3540	December 1, 2023
NVIDIA NSight Compute: The profiler returned an error code:1 Nsight Compute	13	1937	March 18, 2024
Can't Get NCU GUI To Import Properly Nsight Compute	8	1343	October 5, 2020
Errors while running Drive samples DRIVE AGX Orin General driveos-cuda	9	826	July 8, 2023
Error when trying to run an Application with auto-profile on the Nsight Compute GUI Jetson AGX Orin nsight	15	865	March 1, 2023

How to use Nsight Compute to profile a single CUDA kernel on different processes

Related topics