Problem with upgrading jetpack 6.0 to jetpack 6.1

I have come to the conclusion that this upgrade is too buggy to actually use. I realize that is a bit harsh and I need to explain. I upgraded my jetson to jetpack 6.1, After the upgrade, jtop shows the libraries have been upgraded, but it also says that the jetpack is not installed. I have been informed that that is a jtop issue and can be corrected by modifying it source code.

I teach a class with this jetson, and use nsight-compute (previous version 2023.2.2, new version 2024.3.1). This new version Nsight-compute consistely crashed on cuda examples that it had work fine with. I recompiled the examples, and discovered the recompiled code worked fine by itself, but nsight compute ran until the very end and then crashed. It offered to send a report to invidia. It did this for several cuda program that work fine by itself. (perhaps nsight compute also changes and requires some other compilation argument that was different?).

Is there a way to roll back the installation to the previous version? Is there a fix for nsight-compute?

I was fortunate that I as a instructor I have several jetson orin nano systems, and I was able to continue teaching with a unmodified system.

Hi nbeser1,

Are you using the devkit or custom board for Orin Nano?

How did you upgrade it from JP6.0 to JP6.1?

Could you share the result of cat /etc/nv_tegra_release on your board?
Have you tried to run sudo apt install nvidia-jetpack?

Have you tried to use command to flash previous release?

I upgraded JP6.0 to JP6.1 using the method from:
https://docs.nvidia.com/jetson/archives/jetpack-archived/jetpack-61/install-setup/index.html#upgrade-jetpack
This is a dev kit.
I issued the following commands:
sudo apt update
sudo apt install nvidia-jetpack

Here is the file you asked about:
$ cat nv_tegra_release

R36 (release), REVISION: 4.0, GCID: 37537400, BOARD: generic, EABI: aarch64, DATE: Fri Sep 13 04:36:44 UTC 2024

KERNEL_VARIANT: oot

TARGET_USERSPACE_LIB_DIR=nvidia

TARGET_USERSPACE_LIB_DIR_PATH=usr/lib/aarch64-linux-gnu/nvidia

It seems you have updated the devkit to JP6.1(R36.4.0).

It is not allowed to roll-back to previous release, but you can reflash the board for this use case.

You did not answer my question. Is there a fix for nsight-compute? Your announcement for jetpack 6.1 did not list a update for nsight compute, but after 6.1 was updated we were running 2024.3.1. If you intend for 2024.3.1 to be in jetpack 6.1 is there a way to make it work without crashing?

Hi,

We tested ncu on JetPack 6.1 and it can work without issue.

The sample we test is a simple vectorAdd CUDA kernel.
Which app do you use? Is there any EGL or special library involved so we can find a similar app to give it a try?

$ cat /etc/nv_tegra_release 
# R36 (release), REVISION: 4.0, GCID: 37537400, BOARD: generic, EABI: aarch64, DATE: Fri Sep 13 04:36:44 UTC 2024
# KERNEL_VARIANT: oot
TARGET_USERSPACE_LIB_DIR=nvidia
TARGET_USERSPACE_LIB_DIR_PATH=usr/lib/aarch64-linux-gnu/nvidia
$ ncu --version
NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2018-2024 NVIDIA Corporation
Version 2024.3.1.0 (build 34702747) (public-release)
$ sudo /opt/nvidia/nsight-compute/2024.3.1/ncu ./vectorAdd
[Vector addition of 50000 elements]
==PROF== Connected to process 24865 (/home/nvidia/cuda-samples/Samples/0_Introduction/vectorAdd/vectorAdd)
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
==PROF== Profiling "vectorAdd" - 0: 0%....50%....100% - 8 passes
Copy output data from the CUDA device to the host memory
Test PASSED
Done
==PROF== Disconnected from process 24865
[24865] vectorAdd@127.0.0.1
  vectorAdd(const float *, const float *, float *, int) (196, 1, 1)x(256, 1, 1), Context 1, Stream 7, Device 0, CC 8.7
    Section: GPU Speed Of Light Throughput
    ----------------------- ----------- ------------
    Metric Name             Metric Unit Metric Value
    ----------------------- ----------- ------------
    SM Frequency                    Mhz       303.70
    Elapsed Cycles                cycle         7280
    Memory Throughput                 %        37.25
    Duration                         us        23.97
    L1/TEX Cache Throughput           %        13.91
    L2 Cache Throughput               %        37.25
    SM Active Cycles              cycle      3787.56
    Compute (SM) Throughput           %        15.63
    ----------------------- ----------- ------------

    OPT   This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance 
          of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate    
          latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.                 

    Section: Launch Statistics
    -------------------------------- --------------- ---------------
    Metric Name                          Metric Unit    Metric Value
    -------------------------------- --------------- ---------------
    Block Size                                                   256
    Function Cache Configuration                     CachePreferNone
    Grid Size                                                    196
    Registers Per Thread             register/thread              16
    Shared Memory Configuration Size           Kbyte            8.19
    Driver Shared Memory Per Block       Kbyte/block            1.02
    Dynamic Shared Memory Per Block       byte/block               0
    Static Shared Memory Per Block        byte/block               0
    # SMs                                         SM              16
    Threads                                   thread          5,0176
    Uses Green Context                                             0
    Waves Per SM                                                2.04
    -------------------------------- --------------- ---------------

    OPT   Est. Speedup: 33.33%                                                                                          
          A wave of thread blocks is defined as the maximum number of blocks that can be executed in parallel on the    
          target GPU. The number of blocks in a wave depends on the number of multiprocessors and the theoretical       
          occupancy of the kernel. This kernel launch results in 2 full waves and a partial wave of 3 thread blocks.    
          Under the assumption of a uniform execution duration of all thread blocks, the partial wave may account for   
          up to 33.3% of the total kernel runtime with a lower occupancy of 22.5%. Try launching a grid with no         
          partial wave. The overall impact of this tail effect also lessens with the number of full waves executed for  
          a grid. See the Hardware Model                                                                                
          (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more      
          details on launch configurations.                                                                             

    Section: Occupancy
    ------------------------------- ----------- ------------
    Metric Name                     Metric Unit Metric Value
    ------------------------------- ----------- ------------
    Block Limit SM                        block           16
    Block Limit Registers                 block           16
    Block Limit Shared Mem                block            8
    Block Limit Warps                     block            6
    Theoretical Active Warps per SM        warp           48
    Theoretical Occupancy                     %          100
    Achieved Occupancy                        %        77.45
    Achieved Active Warps Per SM           warp        37.18
    ------------------------------- ----------- ------------

    OPT   Est. Local Speedup: 22.55%                                                                                    
          The difference between calculated theoretical (100.0%) and measured achieved occupancy (77.5%) can be the     
          result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can   
          occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices   
          Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on     
          optimizing occupancy.                                                                                         

    Section: GPU and Memory Workload Distribution
    -------------------------- ----------- ------------
    Metric Name                Metric Unit Metric Value
    -------------------------- ----------- ------------
    Average L1 Active Cycles         cycle      3787.56
    Total L1 Elapsed Cycles          cycle       9,4552
    Average L2 Active Cycles         cycle      3397.25
    Total L2 Elapsed Cycles          cycle       5,8216
    Average SM Active Cycles         cycle      3787.56
    Total SM Elapsed Cycles          cycle       9,4552
    Average SMSP Active Cycles       cycle      3752.45
    Total SMSP Elapsed Cycles        cycle      37,8208
    -------------------------- ----------- ------------

    WRN   The optional metric dram__cycles_active.avg could not be found. Collecting it as an additional metric could   
          enable the rule to provide more guidance. 

Thanks.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.