Some metric set and section are not enable

when I use ncu to collect data with command ncu -o report --set full on linux and download report to windows, it seems no data collected


When I use command ncu --section-list , it shows

So how can I fix it?

ncu version NVIDIA (R) Nsight Compute Command Line Profiler Copyright (c) 2018-2023 NVIDIA Corporation Version 2023.1.1.0 (build 32678585) (public-release)

The “Enabled” column in the --list-sets output does not imply that the section is not working properly. It shows if it is enabled or not, given the current sections/sets selection in your current command. If you run ncu --list-section, you are not specifying any sections or sets (group of sections) explicitly, so only the default set (basic) and its associated sections are shown as enabled. If you were to e.g. run ncu --set full --list-sets, you would see that the full set is enabled, and so forth.

The fact that the data collection doesn’t work for you when enabling the full set is a different problem. to help us help you, you should provide at least the following information:

  • Which driver is installed on this system.
  • Which GPU is installed? Provide e.g. the output of nvidia-smi
  • What is the command line output during collection, which errors are shown.
  • Does data collection work if you print directly on the Linux target system, like ncu --set full <app>.
  • Does data collection work if you select a different set (e.g. “basic”) or profile a different test app on that system.

Hi, thanks for the information.
If it is possible to go further without giving information about driver and GPU information?
I tried the flowing command ncu ./a.out
it shows

==PROF== Connected to process 9158 
==PROF== Profiling "stencil_1d(int *, int *)" - 0: 0%....50%....100% - 2 passes
==PROF== Disconnected from process 9158
[9158] a.out@127.0.0.1
  stencil_1d(int *, int *) (64, 1, 1)x(16, 1, 1), Context 1, Stream 7, Device 0, CC 8.0
    Section: GPU Speed Of Light Throughput
    ----------------------- ------------ ------------
    Metric Name              Metric Unit Metric Value
    ----------------------- ------------ ------------
    DRAM Frequency          cycle/second      (!) nan
    SM Frequency            cycle/second      (!) nan
    Elapsed Cycles                 cycle      (!) nan
    Memory Throughput                  %      (!) nan
    DRAM Throughput                    %      (!) nan
    Duration                     usecond         4.03
    L1/TEX Cache Throughput            %      (!) nan
    L2 Cache Throughput                %      (!) nan
    SM Active Cycles               cycle      (!) nan
    Compute (SM) Throughput            %      (!) nan
    ----------------------- ------------ ------------

    INF   The kernel is utilizing greater than 80.0% of the available compute or memory performance of the device. To
          further improve performance, work will likely need to be shifted from the most utilized to another unit.
          Start by analyzing workloads in the Memory Workload Analysis section.

    Section: Launch Statistics
    -------------------------------- --------------- ---------------
    Metric Name                          Metric Unit    Metric Value
    -------------------------------- --------------- ---------------
    Block Size                                                    16
    Function Cache Configuration                     CachePreferNone
    Grid Size                                                     64
    Registers Per Thread             register/thread              18
    Shared Memory Configuration Size           Kbyte           65.54
    Driver Shared Memory Per Block       Kbyte/block            1.02
    Dynamic Shared Memory Per Block       byte/block               0
    Static Shared Memory Per Block        byte/block              88
    Threads                                   thread           1,024
    Waves Per SM                                                0.02
    -------------------------------- --------------- ---------------

    WRN   Threads are executed in groups of 32 threads called warps. This kernel launch is configured to execute 16
          threads per block. Consequently, some threads in a warp are masked off and those hardware resources are
          unused. Try changing the number of threads per block to be a multiple of 32 threads. Between 128 and 256
          threads per block is a good initial range for experimentation. Use smaller thread blocks rather than one
          large thread block per multiprocessor if latency affects performance.  This is particularly beneficial to
          kernels that frequently call __syncthreads(). See the Hardware Model
          (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more
          details on launch configurations.
    ----- --------------------------------------------------------------------------------------------------------------
    WRN   The grid for this launch is configured to execute only 64 blocks, which is less than the GPU's 108
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel
          concurrently with other workloads, consider reducing the block size to have at least one block per
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)
          description for more details on launch configurations.

    Section: Occupancy
    ------------------------------- ----------- ------------
    Metric Name                     Metric Unit Metric Value
    ------------------------------- ----------- ------------
    Block Limit SM                        block           32
    Block Limit Registers                 block           84
    Block Limit Shared Mem                block           56
    Block Limit Warps                     block           64
    Theoretical Active Warps per SM        warp           32
    Theoretical Occupancy                     %           50
    Achieved Occupancy                        %      (!) nan
    Achieved Active Warps Per SM           warp      (!) nan
    ------------------------------- ----------- ------------

    WRN   This kernel's theoretical occupancy (50.0%) is limited by the number of blocks that can fit on the SM. See
          the CUDA Best Practices Guide
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on
          optimizing occupancy.