Can you use nsight to see tensor core occupancy?

g900nvda · January 28, 2024, 5:50am

I studied some introductory material on tensor core using cuBLAS or cuDNN or just bare code using wmma. WHile it is very useful and practical, I would like to see if the code that I sent to execute on tensor core really has executed on tensor core or fallback to regular cuda core if one of the requirement does not met. Can nsight profiling provide insight into that? Thanks.,

Greg · January 29, 2024, 4:40pm

In a full NCU report the following report sections can be used to determine if Tensor Cores were used:

Details | GPU Speed of Light Throughput | GPU Throughput Breakdown | SM: Pipe Tensor Cycles Active [%] > 0
Details | Compute Workload Analysis | Tensor (All)/Tensor (FP)/Tensor (DP)/Tensor (INT) > 0
Details | Instruction Statistics | Executed Instruction Mix | *MMA > 0

g900nvda · February 5, 2024, 7:32pm

thanks how do you get full report?
i used “ncu ” and get following output (below).
I also used “ncu -o profile <executable” which appears to generate -rep binary file.

==PROF== Disconnected from process 2711
[2711] cublas_gemm_example@127.0.0.1
volta_sgemm_32x32_sliced1x4_nn (8, 8, 2)x(128, 1, 1), Context 1, Stream 13, Device 0, CC 7.5
Section: GPU Speed Of Light Throughput
----------------------- ------------- ------------
Metric Name Metric Unit Metric Value
----------------------- ------------- ------------
DRAM Frequency cycle/nsecond 6.74
SM Frequency cycle/nsecond 1.55
Elapsed Cycles cycle 26031
Memory Throughput % 31.59
DRAM Throughput % 3.80
Duration usecond 16.70
L1/TEX Cache Throughput % 61.86
L2 Cache Throughput % 16.39
SM Active Cycles cycle 19957.12
Compute (SM) Throughput % 31.59
----------------------- ------------- ------------

OPT   This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance
      of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate
      latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.

Section: Launch Statistics
-------------------------------- --------------- ---------------
Metric Name                          Metric Unit    Metric Value
-------------------------------- --------------- ---------------
Block Size                                                   128
Function Cache Configuration                     CachePreferNone
Grid Size                                                    128
Registers Per Thread             register/thread              86
Shared Memory Configuration Size           Kbyte           65.54
Driver Shared Memory Per Block        byte/block               0
Dynamic Shared Memory Per Block       byte/block               0
Static Shared Memory Per Block       Kbyte/block           32.77
Threads                                   thread           16384
Waves Per SM                                                1.60
-------------------------------- --------------- ---------------

Section: Occupancy
------------------------------- ----------- ------------
Metric Name                     Metric Unit Metric Value
------------------------------- ----------- ------------
Block Limit SM                        block           16
Block Limit Registers                 block            5
Block Limit Shared Mem                block            2
Block Limit Warps                     block            8
Theoretical Active Warps per SM        warp            8
Theoretical Occupancy                     %           25
Achieved Occupancy                        %        22.76
Achieved Active Warps Per SM           warp         7.28
------------------------------- ----------- ------------

OPT   This kernel's theoretical occupancy (25.0%) is limited by the required amount of shared memory. See the CUDA
      Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for
      more details on optimizing occupancy.

void splitKreduce_kernel<(int)32, (int)16, int, float, float, float, float, (bool)1, (bool)0, (bool)0>(cublasSplitKParams, const T4 *, const T5 *, T5 *, const T6 *, const T6 *, const T7 *, const T4 *, T7 *, void *, long, T6 *, int *) (8, 16, 1)x(32, 16, 1), Context 1, Stream 13, Device 0, CC 7.5
Section: GPU Speed Of Light Throughput
----------------------- ------------- ------------
Metric Name Metric Unit Metric Value
----------------------- ------------- ------------
DRAM Frequency cycle/nsecond 6.12
SM Frequency cycle/nsecond 1.44
Elapsed Cycles cycle 6117
Memory Throughput % 31.69
DRAM Throughput % 31.69
Duration usecond 4.22
L1/TEX Cache Throughput % 21.57
L2 Cache Throughput % 13.84
SM Active Cycles cycle 4272.82
Compute (SM) Throughput % 15.91
----------------------- ------------- ------------

OPT   This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance
      of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate
      latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.

Section: Launch Statistics
-------------------------------- --------------- ---------------
Metric Name                          Metric Unit    Metric Value
-------------------------------- --------------- ---------------
Block Size                                                   512
Function Cache Configuration                     CachePreferNone
Grid Size                                                    128
Registers Per Thread             register/thread              44
Shared Memory Configuration Size           Kbyte           32.77
Driver Shared Memory Per Block        byte/block               0
Dynamic Shared Memory Per Block       byte/block               0
Static Shared Memory Per Block        byte/block               0
Threads                                   thread           65536
Waves Per SM                                                1.60
-------------------------------- --------------- ---------------

OPT   Estimated Speedup: 50%
      A wave of thread blocks is defined as the maximum number of blocks that can be executed in parallel on the
      target GPU. The number of blocks in a wave depends on the number of multiprocessors and the theoretical
      occupancy of the kernel. This kernel launch results in 1 full waves and a partial wave of 48 thread blocks.
      Under the assumption of a uniform execution duration of all thread blocks, the partial wave may account for
      up to 50.0% of the total kernel runtime with a lower occupancy of 21.1%. Try launching a grid with no
      partial wave. The overall impact of this tail effect also lessens with the number of full waves executed for
      a grid. See the Hardware Model
      (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more
      details on launch configurations.

Section: Occupancy
------------------------------- ----------- ------------
Metric Name                     Metric Unit Metric Value
------------------------------- ----------- ------------
Block Limit SM                        block           16
Block Limit Registers                 block            2
Block Limit Shared Mem                block           16
Block Limit Warps                     block            2
Theoretical Active Warps per SM        warp           32
Theoretical Occupancy                     %          100
Achieved Occupancy                        %        78.94
Achieved Active Warps Per SM           warp        25.26
------------------------------- ----------- ------------

OPT   Estimated Speedup: 21.06%
      This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated
      theoretical (100.0%) and measured achieved occupancy (78.9%) can be the result of warp scheduling overheads
      or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block
      as well as across blocks of the same kernel. See the CUDA Best Practices Guide
      (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on
      optimizing occupancy.

veraj · February 22, 2024, 7:36am

You can add options “–set full” in the command line. And then you can open report in NCU GUI directly. Switch to the details page, you can see related info.

veraj · March 23, 2024, 7:37am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Maximum Tensor Core utilization Nsight Compute cuda , kernel	4	143	March 20, 2025
Some metric set and section are not enable Nsight Compute cuda , ubuntu	5	1523	January 16, 2024
Nsight Profile of NVIDIA/CUDALibrarySamples/cuTENSOR. Does it use CUDA Programming and Performance	4	516	November 22, 2022
Nsight Compute-Roofline chart Nsight Compute	12	1483	September 20, 2024
Metric references and description Nsight Compute	7	4517	March 2, 2024
How to measure Tensor core utilization using NVIDIA profiling tools such as Nsight System, DLProf, nvprof etc TensorRT cudnn	4	1598	January 31, 2024
How can I prevent my customized CUDA kernel function from using tensor cores on a Jetson Orin device? Jetson AGX Orin cuda , kernel	19	1005	February 5, 2024
Why Low Tensor Pipe Utilization CUDA Programming and Performance cuda , kernel	4	1254	May 20, 2022
Run ncu command in ubuntu 20.04 Nsight Compute	7	5389	August 8, 2022
How to get Nsight Compute timeline of tensor cores and cuda cores? Nsight Compute cuda , kernel	5	794	April 16, 2024

Can you use nsight to see tensor core occupancy?

Related topics