I have come to the conclusion that this upgrade is too buggy to actually use. I realize that is a bit harsh and I need to explain. I upgraded my jetson to jetpack 6.1, After the upgrade, jtop shows the libraries have been upgraded, but it also says that the jetpack is not installed. I have been informed that that is a jtop issue and can be corrected by modifying it source code.
I teach a class with this jetson, and use nsight-compute (previous version 2023.2.2, new version 2024.3.1). This new version Nsight-compute consistely crashed on cuda examples that it had work fine with. I recompiled the examples, and discovered the recompiled code worked fine by itself, but nsight compute ran until the very end and then crashed. It offered to send a report to invidia. It did this for several cuda program that work fine by itself. (perhaps nsight compute also changes and requires some other compilation argument that was different?).
Is there a way to roll back the installation to the previous version? Is there a fix for nsight-compute?
I was fortunate that I as a instructor I have several jetson orin nano systems, and I was able to continue teaching with a unmodified system.
You did not answer my question. Is there a fix for nsight-compute? Your announcement for jetpack 6.1 did not list a update for nsight compute, but after 6.1 was updated we were running 2024.3.1. If you intend for 2024.3.1 to be in jetpack 6.1 is there a way to make it work without crashing?
We tested ncu on JetPack 6.1 and it can work without issue.
The sample we test is a simple vectorAdd CUDA kernel.
Which app do you use? Is there any EGL or special library involved so we can find a similar app to give it a try?
$ ncu --version
NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2018-2024 NVIDIA Corporation
Version 2024.3.1.0 (build 34702747) (public-release)
$ sudo /opt/nvidia/nsight-compute/2024.3.1/ncu ./vectorAdd
[Vector addition of 50000 elements]
==PROF== Connected to process 24865 (/home/nvidia/cuda-samples/Samples/0_Introduction/vectorAdd/vectorAdd)
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
==PROF== Profiling "vectorAdd" - 0: 0%....50%....100% - 8 passes
Copy output data from the CUDA device to the host memory
Test PASSED
Done
==PROF== Disconnected from process 24865
[24865] vectorAdd@127.0.0.1
vectorAdd(const float *, const float *, float *, int) (196, 1, 1)x(256, 1, 1), Context 1, Stream 7, Device 0, CC 8.7
Section: GPU Speed Of Light Throughput
----------------------- ----------- ------------
Metric Name Metric Unit Metric Value
----------------------- ----------- ------------
SM Frequency Mhz 303.70
Elapsed Cycles cycle 7280
Memory Throughput % 37.25
Duration us 23.97
L1/TEX Cache Throughput % 13.91
L2 Cache Throughput % 37.25
SM Active Cycles cycle 3787.56
Compute (SM) Throughput % 15.63
----------------------- ----------- ------------
OPT This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance
of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate
latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.
Section: Launch Statistics
-------------------------------- --------------- ---------------
Metric Name Metric Unit Metric Value
-------------------------------- --------------- ---------------
Block Size 256
Function Cache Configuration CachePreferNone
Grid Size 196
Registers Per Thread register/thread 16
Shared Memory Configuration Size Kbyte 8.19
Driver Shared Memory Per Block Kbyte/block 1.02
Dynamic Shared Memory Per Block byte/block 0
Static Shared Memory Per Block byte/block 0
# SMs SM 16
Threads thread 5,0176
Uses Green Context 0
Waves Per SM 2.04
-------------------------------- --------------- ---------------
OPT Est. Speedup: 33.33%
A wave of thread blocks is defined as the maximum number of blocks that can be executed in parallel on the
target GPU. The number of blocks in a wave depends on the number of multiprocessors and the theoretical
occupancy of the kernel. This kernel launch results in 2 full waves and a partial wave of 3 thread blocks.
Under the assumption of a uniform execution duration of all thread blocks, the partial wave may account for
up to 33.3% of the total kernel runtime with a lower occupancy of 22.5%. Try launching a grid with no
partial wave. The overall impact of this tail effect also lessens with the number of full waves executed for
a grid. See the Hardware Model
(https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more
details on launch configurations.
Section: Occupancy
------------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
------------------------------- ----------- ------------
Block Limit SM block 16
Block Limit Registers block 16
Block Limit Shared Mem block 8
Block Limit Warps block 6
Theoretical Active Warps per SM warp 48
Theoretical Occupancy % 100
Achieved Occupancy % 77.45
Achieved Active Warps Per SM warp 37.18
------------------------------- ----------- ------------
OPT Est. Local Speedup: 22.55%
The difference between calculated theoretical (100.0%) and measured achieved occupancy (77.5%) can be the
result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can
occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices
Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on
optimizing occupancy.
Section: GPU and Memory Workload Distribution
-------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
-------------------------- ----------- ------------
Average L1 Active Cycles cycle 3787.56
Total L1 Elapsed Cycles cycle 9,4552
Average L2 Active Cycles cycle 3397.25
Total L2 Elapsed Cycles cycle 5,8216
Average SM Active Cycles cycle 3787.56
Total SM Elapsed Cycles cycle 9,4552
Average SMSP Active Cycles cycle 3752.45
Total SMSP Elapsed Cycles cycle 37,8208
-------------------------- ----------- ------------
WRN The optional metric dram__cycles_active.avg could not be found. Collecting it as an additional metric could
enable the rule to provide more guidance.