Please provide the following info (tick the boxes after creating this topic):
Software Version
DRIVE OS 6.0.10.0
DRIVE OS 6.0.8.1
DRIVE OS 6.0.6
DRIVE OS 6.0.5
DRIVE OS 6.0.4 (rev. 1)
DRIVE OS 6.0.4 SDK
other
Target Operating System
Linux
QNX
other
Hardware Platform
DRIVE AGX Orin Developer Kit (940-63710-0010-300)
DRIVE AGX Orin Developer Kit (940-63710-0010-200)
DRIVE AGX Orin Developer Kit (940-63710-0010-100)
DRIVE AGX Orin Developer Kit (940-63710-0010-D00)
DRIVE AGX Orin Developer Kit (940-63710-0010-C00)
DRIVE AGX Orin Developer Kit (not sure its number)
other
SDK Manager Version
2.1.0
other
Host Machine Version
native Ubuntu Linux 20.04 Host installed with SDK Manager
native Ubuntu Linux 20.04 Host installed with DRIVE OS Docker Containers
native Ubuntu Linux 18.04 Host installed with DRIVE OS Docker Containers
other
Issue Description
Hello,
I am currently trying to profile some custom CUDA kernels on the Orin using Nsight Compute. I copied the command from Nsight Systems and tried to profile. The following is the full command:
ncu --kernel-name <kernel_name> --launch-skip 0 --launch-count 1 <app>
However, when I launch this, ncu told me that no kernels are profiled since they are spawned on other processes. So I added
--target-processes all
The problem now is that the profiler doesn’t even start. I would greatly appreciate if I can have some pointers on this matter!
Dear @extern.ray.xie ,
Are you running ncu on target?
I quickly tested ncu on target like below.
I copied /opt/nvidia/nsight-compute/2021.2.10/target/linux-v4l_l4t-t210-a64
and /opt/nvidia/nsight-compute/2021.2.10/sections
folder s from DRIVE OS 6.0.10 docker container to target at ~/2021.2.10
folder
nvidia@tegra-ubuntu:~/2021.2.10$ ls
sections target
nvidia@tegra-ubuntu:~/2021.2.10$ cd target/
nvidia@tegra-ubuntu:~/2021.2.10/target$ ls
linux-v4l_l4t-t210-a64
nvidia@tegra-ubuntu:~/2021.2.10/target$ cd /usr/local/cuda/samples/0_Simple/matrixMul
nvidia@tegra-ubuntu:/usr/local/cuda/samples/0_Simple/matrixMul$ sudo -s
root@tegra-ubuntu:/usr/local/cuda-11.4/samples/0_Simple/matrixMul# /home/nvidia/2021.2.10/target/linux-v4l_l4t-t210-a64/ncu --kernel-name MatrixMulCUDA --target-processes application-only --launch-skip 0 --launch-count 1 matrixMul
[Matrix Multiply Using CUDA] - Starting...
==PROF== Connected to process 204239 (/usr/local/cuda-11.4/samples/0_Simple/matrixMul/matrixMul)
GPU Device 0: "Ampere" with compute capability 8.7
MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
==PROF== Profiling "MatrixMulCUDA": 0%....50%....100% - 9 passes
done
Performance= 617.78 GFlop/s, Time= 0.212 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS
NOTE: The CUDA Samples are not meant for performancemeasurements. Results may vary when GPU Boost is enabled.
==PROF== Disconnected from process 204239
[204239] matrixMul@127.0.0.1
void MatrixMulCUDA<(int)32>(float *, float *, float *, int, int), 2025-Apr-09 07:14:57, Context 1, Stream 13
Section: GPU Speed Of Light Throughput
---------------------------------------------------------------------- --------------- ------------------------------
SM Frequency cycle/nsecond 1.23
Elapsed Cycles cycle 298988
Memory [%] % 70.92
Duration usecond 242.40
L1/TEX Cache Throughput % 74.34
L2 Cache Throughput % 10.09
SM Active Cycles cycle 285225.75
Compute (SM) [%] % 59.27
---------------------------------------------------------------------- --------------- ------------------------------
WRN Memory is more heavily utilized than Compute: Look at the Memory Workload Analysis report section to see
where the memory system bottleneck is. Check memory replay (coalescing) metrics to make sure you're
efficiently utilizing the bytes transferred. Also consider whether it is possible to do more work per memory
access (kernel fusion) or whether there are values you can (re)compute.
Section: Launch Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Block Size 1024
Function Cache Configuration cudaFuncCachePreferNone
Grid Size 200
Registers Per Thread register/thread 37
Shared Memory Configuration Size Kbyte 16.38
Driver Shared Memory Per Block Kbyte/block 1.02
Dynamic Shared Memory Per Block byte/block 0
Static Shared Memory Per Block Kbyte/block 8.19
Threads thread 204800
Waves Per SM 12.50
---------------------------------------------------------------------- --------------- ------------------------------
Section: Occupancy
---------------------------------------------------------------------- --------------- ------------------------------
Block Limit SM block 16
Block Limit Registers block 1
Block Limit Shared Mem block 18
Block Limit Warps block 1
Theoretical Active Warps per SM warp 32
Theoretical Occupancy % 66.67
Achieved Occupancy % 66.47
Achieved Active Warps Per SM warp 31.91
---------------------------------------------------------------------- --------------- ------------------------------
WRN This kernel's theoretical occupancy (66.7%) is limited by the number of required registers This kernel's
theoretical occupancy (66.7%) is limited by the number of warps within each block
Hi Siva,
Thank you for responding. I think profiling a kernel that is on the main process should be okay with ncu (which is the case for the example you have there). However, I am looking to profile kernels that are spawned on different threads by the main process. How can I do that?
In addition, I also don’t see this print out when I close my program.
The used flag is correct. If you notice any error, could you share the details.