Using Nsight Compute to Inspect your Kernels

Originally published at: https://developer.nvidia.com/blog/using-nsight-compute-to-inspect-your-kernels/

By now, hopefully you read the first two blogs in this series “Migrating to NVIDIA Nsight Tools from NVVP and Nvprof” and “Transitioning to Nsight Systems from NVIDIA Visual Profiler / nvprof,” and you’ve discovered NVIDIA added a few new tools, both Nsight Compute and Nsight Systems, to the repertoire of CUDA tools available for…

How am I able to use Nsight Compute with python scripts?

For general questions about using the tools, as indicated in the blog article, the suggestion is to ask those questions on the tools forum. For nsight compute, that is here: https://forums.developer.nvidia.com/c/development-tools/nsight-compute/114

Generally speaking, there shouldn’t be anything special required to use nsight compute with python scripts. Using the sample python/cupy code here: https://stackoverflow.com/a/61567110/1695960 If I do:

ncu python t1.py

I get output like this:

$ ncu --version
NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2012-2020 NVIDIA Corporation
Version 2020.1.0 (Build 28294165)
$ ncu python t1.py
==PROF== Connected to process 57559 (/home/nvidia/anaconda3/bin/python3.7)
==PROF== Profiling "generate_seed_pseudo" - 1: 0%....50%....100% - 19 passes
==PROF== Profiling "gen_sequenced" - 2: 0%....50%....100% - 19 passes
==PROF== Profiling "cupy_multiply" - 3: 0%....50%....100% - 19 passes
==PROF== Profiling "cupy_add" - 4: 0%....50%....100% - 19 passes
==PROF== Profiling "clamp_generic" - 5: 0%....50%....100% - 19 passes
==PROF== Disconnected from process 57559
[57559] python3.7@127.0.0.1
  generate_seed_pseudo(unsigned long long, unsigned long long, unsigned long long, curandOrdering, curandStateXORWOW*, unsigned int*), 2020-Aug-31 14:38:59, Context 1, Stream 7
    Section: GPU Speed Of Light
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/usecond                         867.55
    SM Frequency                                                             cycle/nsecond                           1.28
    Elapsed Cycles                                                                   cycle                      1,427,785
    Memory [%]                                                                           %                          34.91
    SOL DRAM                                                                             %                           0.04
    Duration                                                                       msecond                           1.11
    SOL L1/TEX Cache                                                                     %                          53.14
    SOL L2 Cache                                                                         %                          34.91
    SM Active Cycles                                                                 cycle                     797,969.30
    SM [%]                                                                               %                           5.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device. Look at Launch Statistics for
          more details.

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Grid Size                                                                                                          64
    Registers Per Thread                                                   register/thread                             48
    Shared Memory Configuration Size                                                  byte                              0
    Driver Shared Memory Per Block                                              byte/block                              0
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          4,096
    Waves Per SM                                                                                                     0.04
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 64 blocks, which is less than the GPU's 80
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel
          concurrently with other workloads, consider reducing the block size to have at least one block per
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources.

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             32
    Block Limit Registers                                                            block                             20
    Block Limit Shared Mem                                                           block                             32
    Block Limit Warps                                                                block                             32
    Theoretical Active Warps per SM                                             warp/cycle                             40
    Theoretical Occupancy                                                                %                          62.50
    Achieved Occupancy                                                                   %                           3.01
    Achieved Active Warps Per SM                                                      warp                           1.93
    ---------------------------------------------------------------------- --------------- ------------------------------

  void gen_sequenced<curandStateXORWOW, double2, normal_args_double_st, &(double2 curand_normal_scaled2_double<curandStateXORWOW>(curandStateXORWOW*, normal_args_double_st))>(curandStateXORWOW*, double2*, unsigned long, unsigned long, normal_args_double_st), 2020-Aug-31 14:38:59, Context 1, Stream 7
    Section: GPU Speed Of Light
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/usecond                         597.09
    SM Frequency                                                             cycle/usecond                         880.71
    Elapsed Cycles                                                                   cycle                          5,812
    Memory [%]                                                                           %                          11.79
    SOL DRAM                                                                             %                           4.88
    Duration                                                                       usecond                           6.59
    SOL L1/TEX Cache                                                                     %                          17.67
    SOL L2 Cache                                                                         %                          11.79
    SM Active Cycles                                                                 cycle                       3,159.89
    SM [%]                                                                               %                           3.61
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device. Look at Launch Statistics for
          more details.

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Grid Size                                                                                                          64
    Registers Per Thread                                                   register/thread                             36
    Shared Memory Configuration Size                                                  byte                              0
    Driver Shared Memory Per Block                                              byte/block                              0
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          4,096
    Waves Per SM                                                                                                     0.03
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 64 blocks, which is less than the GPU's 80
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel
          concurrently with other workloads, consider reducing the block size to have at least one block per
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources.

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             32
    Block Limit Registers                                                            block                             24
    Block Limit Shared Mem                                                           block                             32
    Block Limit Warps                                                                block                             32
    Theoretical Active Warps per SM                                             warp/cycle                             48
    Theoretical Occupancy                                                                %                             75
    Achieved Occupancy                                                                   %                           3.07
    Achieved Active Warps Per SM                                                      warp                           1.96
    ---------------------------------------------------------------------- --------------- ------------------------------

  cupy_multiply, 2020-Aug-31 14:39:00, Context 1, Stream 7
    Section: GPU Speed Of Light
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/usecond                         669.64
    SM Frequency                                                             cycle/usecond                         990.33
    Elapsed Cycles                                                                   cycle                          3,552
    Memory [%]                                                                           %                           5.21
    SOL DRAM                                                                             %                           5.21
    Duration                                                                       usecond                           3.58
    SOL L1/TEX Cache                                                                     %                           4.15
    SOL L2 Cache                                                                         %                           3.76
    SM Active Cycles                                                                 cycle                       1,555.72
    SM [%]                                                                               %                           1.83
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device. Look at Launch Statistics for
          more details.

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Grid Size                                                                                                         125
    Registers Per Thread                                                   register/thread                             18
    Shared Memory Configuration Size                                                  byte                              0
    Driver Shared Memory Per Block                                              byte/block                              0
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                         16,000
    Waves Per SM                                                                                                     0.10
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the
          achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the
          hardware busy.

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             32
    Block Limit Registers                                                            block                             21
    Block Limit Shared Mem                                                           block                             32
    Block Limit Warps                                                                block                             16
    Theoretical Active Warps per SM                                             warp/cycle                             64
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                           8.97
    Achieved Active Warps Per SM                                                      warp                           5.74
    ---------------------------------------------------------------------- --------------- ------------------------------

  cupy_add, 2020-Aug-31 14:39:00, Context 1, Stream 7
    Section: GPU Speed Of Light
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/usecond                         650.49
    SM Frequency                                                             cycle/usecond                         961.01
    Elapsed Cycles                                                                   cycle                          3,170
    Memory [%]                                                                           %                           5.83
    SOL DRAM                                                                             %                           5.83
    Duration                                                                       usecond                           3.30
    SOL L1/TEX Cache                                                                     %                           4.65
    SOL L2 Cache                                                                         %                           4.21
    SM Active Cycles                                                                 cycle                       1,387.59
    SM [%]                                                                               %                           2.05
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device. Look at Launch Statistics for
          more details.

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Grid Size                                                                                                         125
    Registers Per Thread                                                   register/thread                             18
    Shared Memory Configuration Size                                                  byte                              0
    Driver Shared Memory Per Block                                              byte/block                              0
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                         16,000
    Waves Per SM                                                                                                     0.10
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the
          achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the
          hardware busy.

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             32
    Block Limit Registers                                                            block                             21
    Block Limit Shared Mem                                                           block                             32
    Block Limit Warps                                                                block                             16
    Theoretical Active Warps per SM                                             warp/cycle                             64
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                           9.00
    Achieved Active Warps Per SM                                                      warp                           5.76
    ---------------------------------------------------------------------- --------------- ------------------------------

  clamp_generic, 2020-Aug-31 14:39:00, Context 1, Stream 7
    Section: GPU Speed Of Light
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/usecond                         650.49
    SM Frequency                                                             cycle/usecond                         959.80
    Elapsed Cycles                                                                   cycle                          3,166
    Memory [%]                                                                           %                           5.83
    SOL DRAM                                                                             %                           5.83
    Duration                                                                       usecond                           3.30
    SOL L1/TEX Cache                                                                     %                           4.59
    SOL L2 Cache                                                                         %                           4.21
    SM Active Cycles                                                                 cycle                       1,404.49
    SM [%]                                                                               %                           1.83
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device. Look at Launch Statistics for
          more details.

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Grid Size                                                                                                         125
    Registers Per Thread                                                   register/thread                             16
    Shared Memory Configuration Size                                                  byte                              0
    Driver Shared Memory Per Block                                              byte/block                              0
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                         16,000
    Waves Per SM                                                                                                     0.10
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the
          achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the
          hardware busy.

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             32
    Block Limit Registers                                                            block                             32
    Block Limit Shared Mem                                                           block                             32
    Block Limit Warps                                                                block                             16
    Theoretical Active Warps per SM                                             warp/cycle                             64
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                           9.00
    Achieved Active Warps Per SM                                                      warp                           5.76
    ---------------------------------------------------------------------- --------------- ------------------------------

$

I probably won’t be able to respond to further/detailed questions here. Please ask nsight compute usage questions on the forum I already linked.