Using Nsight Compute to Inspect your Kernels

jwitsoe · August 25, 2020, 11:49pm

Originally published at: https://developer.nvidia.com/blog/using-nsight-compute-to-inspect-your-kernels/

By now, hopefully you read the first two blogs in this series “Migrating to NVIDIA Nsight Tools from NVVP and Nvprof” and “Transitioning to Nsight Systems from NVIDIA Visual Profiler / nvprof,” and you’ve discovered NVIDIA added a few new tools, both Nsight Compute and Nsight Systems, to the repertoire of CUDA tools available for…

13ecd1 · August 31, 2020, 6:30pm

How am I able to use Nsight Compute with python scripts?

Robert_Crovella · August 31, 2020, 9:46pm

For general questions about using the tools, as indicated in the blog article, the suggestion is to ask those questions on the tools forum. For nsight compute, that is here: Nsight Compute - NVIDIA Developer Forums

Generally speaking, there shouldn’t be anything special required to use nsight compute with python scripts. Using the sample python/cupy code here: python 3.x - cupy indexing is slow - Stack Overflow If I do:

ncu python t1.py

I get output like this:

$ ncu --version
NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2012-2020 NVIDIA Corporation
Version 2020.1.0 (Build 28294165)
$ ncu python t1.py
==PROF== Connected to process 57559 (/home/nvidia/anaconda3/bin/python3.7)
==PROF== Profiling "generate_seed_pseudo" - 1: 0%....50%....100% - 19 passes
==PROF== Profiling "gen_sequenced" - 2: 0%....50%....100% - 19 passes
==PROF== Profiling "cupy_multiply" - 3: 0%....50%....100% - 19 passes
==PROF== Profiling "cupy_add" - 4: 0%....50%....100% - 19 passes
==PROF== Profiling "clamp_generic" - 5: 0%....50%....100% - 19 passes
==PROF== Disconnected from process 57559
[57559] python3.7@127.0.0.1
  generate_seed_pseudo(unsigned long long, unsigned long long, unsigned long long, curandOrdering, curandStateXORWOW*, unsigned int*), 2020-Aug-31 14:38:59, Context 1, Stream 7
    Section: GPU Speed Of Light
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/usecond                         867.55
    SM Frequency                                                             cycle/nsecond                           1.28
    Elapsed Cycles                                                                   cycle                      1,427,785
    Memory [%]                                                                           %                          34.91
    SOL DRAM                                                                             %                           0.04
    Duration                                                                       msecond                           1.11
    SOL L1/TEX Cache                                                                     %                          53.14
    SOL L2 Cache                                                                         %                          34.91
    SM Active Cycles                                                                 cycle                     797,969.30
    SM [%]                                                                               %                           5.06
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device. Look at Launch Statistics for
          more details.

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Grid Size                                                                                                          64
    Registers Per Thread                                                   register/thread                             48
    Shared Memory Configuration Size                                                  byte                              0
    Driver Shared Memory Per Block                                              byte/block                              0
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          4,096
    Waves Per SM                                                                                                     0.04
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 64 blocks, which is less than the GPU's 80
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel
          concurrently with other workloads, consider reducing the block size to have at least one block per
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources.

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             32
    Block Limit Registers                                                            block                             20
    Block Limit Shared Mem                                                           block                             32
    Block Limit Warps                                                                block                             32
    Theoretical Active Warps per SM                                             warp/cycle                             40
    Theoretical Occupancy                                                                %                          62.50
    Achieved Occupancy                                                                   %                           3.01
    Achieved Active Warps Per SM                                                      warp                           1.93
    ---------------------------------------------------------------------- --------------- ------------------------------

  void gen_sequenced<curandStateXORWOW, double2, normal_args_double_st, &(double2 curand_normal_scaled2_double<curandStateXORWOW>(curandStateXORWOW*, normal_args_double_st))>(curandStateXORWOW*, double2*, unsigned long, unsigned long, normal_args_double_st), 2020-Aug-31 14:38:59, Context 1, Stream 7
    Section: GPU Speed Of Light
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/usecond                         597.09
    SM Frequency                                                             cycle/usecond                         880.71
    Elapsed Cycles                                                                   cycle                          5,812
    Memory [%]                                                                           %                          11.79
    SOL DRAM                                                                             %                           4.88
    Duration                                                                       usecond                           6.59
    SOL L1/TEX Cache                                                                     %                          17.67
    SOL L2 Cache                                                                         %                          11.79
    SM Active Cycles                                                                 cycle                       3,159.89
    SM [%]                                                                               %                           3.61
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device. Look at Launch Statistics for
          more details.

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                         64
    Grid Size                                                                                                          64
    Registers Per Thread                                                   register/thread                             36
    Shared Memory Configuration Size                                                  byte                              0
    Driver Shared Memory Per Block                                              byte/block                              0
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                          4,096
    Waves Per SM                                                                                                     0.03
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   The grid for this launch is configured to execute only 64 blocks, which is less than the GPU's 80
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel
          concurrently with other workloads, consider reducing the block size to have at least one block per
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources.

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             32
    Block Limit Registers                                                            block                             24
    Block Limit Shared Mem                                                           block                             32
    Block Limit Warps                                                                block                             32
    Theoretical Active Warps per SM                                             warp/cycle                             48
    Theoretical Occupancy                                                                %                             75
    Achieved Occupancy                                                                   %                           3.07
    Achieved Active Warps Per SM                                                      warp                           1.96
    ---------------------------------------------------------------------- --------------- ------------------------------

  cupy_multiply, 2020-Aug-31 14:39:00, Context 1, Stream 7
    Section: GPU Speed Of Light
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/usecond                         669.64
    SM Frequency                                                             cycle/usecond                         990.33
    Elapsed Cycles                                                                   cycle                          3,552
    Memory [%]                                                                           %                           5.21
    SOL DRAM                                                                             %                           5.21
    Duration                                                                       usecond                           3.58
    SOL L1/TEX Cache                                                                     %                           4.15
    SOL L2 Cache                                                                         %                           3.76
    SM Active Cycles                                                                 cycle                       1,555.72
    SM [%]                                                                               %                           1.83
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device. Look at Launch Statistics for
          more details.

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Grid Size                                                                                                         125
    Registers Per Thread                                                   register/thread                             18
    Shared Memory Configuration Size                                                  byte                              0
    Driver Shared Memory Per Block                                              byte/block                              0
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                         16,000
    Waves Per SM                                                                                                     0.10
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the
          achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the
          hardware busy.

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             32
    Block Limit Registers                                                            block                             21
    Block Limit Shared Mem                                                           block                             32
    Block Limit Warps                                                                block                             16
    Theoretical Active Warps per SM                                             warp/cycle                             64
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                           8.97
    Achieved Active Warps Per SM                                                      warp                           5.74
    ---------------------------------------------------------------------- --------------- ------------------------------

  cupy_add, 2020-Aug-31 14:39:00, Context 1, Stream 7
    Section: GPU Speed Of Light
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/usecond                         650.49
    SM Frequency                                                             cycle/usecond                         961.01
    Elapsed Cycles                                                                   cycle                          3,170
    Memory [%]                                                                           %                           5.83
    SOL DRAM                                                                             %                           5.83
    Duration                                                                       usecond                           3.30
    SOL L1/TEX Cache                                                                     %                           4.65
    SOL L2 Cache                                                                         %                           4.21
    SM Active Cycles                                                                 cycle                       1,387.59
    SM [%]                                                                               %                           2.05
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device. Look at Launch Statistics for
          more details.

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Grid Size                                                                                                         125
    Registers Per Thread                                                   register/thread                             18
    Shared Memory Configuration Size                                                  byte                              0
    Driver Shared Memory Per Block                                              byte/block                              0
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                         16,000
    Waves Per SM                                                                                                     0.10
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the
          achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the
          hardware busy.

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             32
    Block Limit Registers                                                            block                             21
    Block Limit Shared Mem                                                           block                             32
    Block Limit Warps                                                                block                             16
    Theoretical Active Warps per SM                                             warp/cycle                             64
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                           9.00
    Achieved Active Warps Per SM                                                      warp                           5.76
    ---------------------------------------------------------------------- --------------- ------------------------------

  clamp_generic, 2020-Aug-31 14:39:00, Context 1, Stream 7
    Section: GPU Speed Of Light
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/usecond                         650.49
    SM Frequency                                                             cycle/usecond                         959.80
    Elapsed Cycles                                                                   cycle                          3,166
    Memory [%]                                                                           %                           5.83
    SOL DRAM                                                                             %                           5.83
    Duration                                                                       usecond                           3.30
    SOL L1/TEX Cache                                                                     %                           4.59
    SOL L2 Cache                                                                         %                           4.21
    SM Active Cycles                                                                 cycle                       1,404.49
    SM [%]                                                                               %                           1.83
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device. Look at Launch Statistics for
          more details.

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                        128
    Grid Size                                                                                                         125
    Registers Per Thread                                                   register/thread                             16
    Shared Memory Configuration Size                                                  byte                              0
    Driver Shared Memory Per Block                                              byte/block                              0
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                         16,000
    Waves Per SM                                                                                                     0.10
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the
          achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the
          hardware busy.

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             32
    Block Limit Registers                                                            block                             32
    Block Limit Shared Mem                                                           block                             32
    Block Limit Warps                                                                block                             16
    Theoretical Active Warps per SM                                             warp/cycle                             64
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                           9.00
    Achieved Active Warps Per SM                                                      warp                           5.76
    ---------------------------------------------------------------------- --------------- ------------------------------

$

I probably won’t be able to respond to further/detailed questions here. Please ask nsight compute usage questions on the forum I already linked.

akasshdeep99 · January 8, 2026, 7:31am

Hi Everyone,

I am trying to test the performance difference between 8b and 70b model on 4xH100 card system. It’s a RAG benchmark script I have created to study the performance. My script sends out 100 questions in a batches and vllm process it and generates the output. I have collected NCU data with one question on both 8b and 70b model, they both look similar , what I want to understand is where is the bottle neck as 70b performance is bit lower than 8b. Theoretically I know the reason but on GPU metrics what I should see?

Topic		Replies	Views
Nsight and nvprof results have large differences Nsight Compute	9	1314	November 26, 2019
Can't Get NCU GUI To Import Properly Nsight Compute	8	1524	October 5, 2020
How do i get some of the nvprof metrics in insight? Nsight Compute	0	770	June 2, 2021
Analysis-Driven Optimization: Preparing for Analysis with NVIDIA Nsight Compute, Part 1 Technical Blog	1	422	January 27, 2021
Metric references and description Nsight Compute	7	5182	March 2, 2024
Analysis-Driven Optimization: Analyzing and Improving Performance with NVIDIA Nsight Compute, Part 2 Technical Blog	0	308	January 27, 2021
Question about profiling nccl kernels with Nsight Compute Nsight Compute	23	5657	December 24, 2025
what is the mean of `gpu__compute_memory_access_throughput` Nsight Compute	4	1080	August 22, 2019
How to use Nsight Compute Profiling Results CUDA Programming and Performance	7	1273	December 17, 2019
NVIDIA® Nsight™ Compute 1.0 is now available Nsight Compute	0	851	September 17, 2018

Using Nsight Compute to Inspect your Kernels

Related topics