I studied some introductory material on tensor core using cuBLAS or cuDNN or just bare code using wmma. WHile it is very useful and practical, I would like to see if the code that I sent to execute on tensor core really has executed on tensor core or fallback to regular cuda core if one of the requirement does not met. Can nsight profiling provide insight into that? Thanks.,
In a full NCU report the following report sections can be used to determine if Tensor Cores were used:
- Details | GPU Speed of Light Throughput | GPU Throughput Breakdown | SM: Pipe Tensor Cycles Active [%] > 0
- Details | Compute Workload Analysis | Tensor (All)/Tensor (FP)/Tensor (DP)/Tensor (INT) > 0
- Details | Instruction Statistics | Executed Instruction Mix | *MMA > 0
thanks how do you get full report?
i used “ncu ” and get following output (below).
I also used “ncu -o profile <executable” which appears to generate -rep binary file.
==PROF== Disconnected from process 2711
[2711] cublas_gemm_example@127.0.0.1
volta_sgemm_32x32_sliced1x4_nn (8, 8, 2)x(128, 1, 1), Context 1, Stream 13, Device 0, CC 7.5
Section: GPU Speed Of Light Throughput
----------------------- ------------- ------------
Metric Name Metric Unit Metric Value
----------------------- ------------- ------------
DRAM Frequency cycle/nsecond 6.74
SM Frequency cycle/nsecond 1.55
Elapsed Cycles cycle 26031
Memory Throughput % 31.59
DRAM Throughput % 3.80
Duration usecond 16.70
L1/TEX Cache Throughput % 61.86
L2 Cache Throughput % 16.39
SM Active Cycles cycle 19957.12
Compute (SM) Throughput % 31.59
----------------------- ------------- ------------OPT This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons. Section: Launch Statistics -------------------------------- --------------- --------------- Metric Name Metric Unit Metric Value -------------------------------- --------------- --------------- Block Size 128 Function Cache Configuration CachePreferNone Grid Size 128 Registers Per Thread register/thread 86 Shared Memory Configuration Size Kbyte 65.54 Driver Shared Memory Per Block byte/block 0 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block Kbyte/block 32.77 Threads thread 16384 Waves Per SM 1.60 -------------------------------- --------------- --------------- Section: Occupancy ------------------------------- ----------- ------------ Metric Name Metric Unit Metric Value ------------------------------- ----------- ------------ Block Limit SM block 16 Block Limit Registers block 5 Block Limit Shared Mem block 2 Block Limit Warps block 8 Theoretical Active Warps per SM warp 8 Theoretical Occupancy % 25 Achieved Occupancy % 22.76 Achieved Active Warps Per SM warp 7.28 ------------------------------- ----------- ------------ OPT This kernel's theoretical occupancy (25.0%) is limited by the required amount of shared memory. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy.
void splitKreduce_kernel<(int)32, (int)16, int, float, float, float, float, (bool)1, (bool)0, (bool)0>(cublasSplitKParams, const T4 *, const T5 *, T5 *, const T6 *, const T6 *, const T7 *, const T4 *, T7 *, void *, long, T6 *, int *) (8, 16, 1)x(32, 16, 1), Context 1, Stream 13, Device 0, CC 7.5
Section: GPU Speed Of Light Throughput
----------------------- ------------- ------------
Metric Name Metric Unit Metric Value
----------------------- ------------- ------------
DRAM Frequency cycle/nsecond 6.12
SM Frequency cycle/nsecond 1.44
Elapsed Cycles cycle 6117
Memory Throughput % 31.69
DRAM Throughput % 31.69
Duration usecond 4.22
L1/TEX Cache Throughput % 21.57
L2 Cache Throughput % 13.84
SM Active Cycles cycle 4272.82
Compute (SM) Throughput % 15.91
----------------------- ------------- ------------OPT This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons. Section: Launch Statistics -------------------------------- --------------- --------------- Metric Name Metric Unit Metric Value -------------------------------- --------------- --------------- Block Size 512 Function Cache Configuration CachePreferNone Grid Size 128 Registers Per Thread register/thread 44 Shared Memory Configuration Size Kbyte 32.77 Driver Shared Memory Per Block byte/block 0 Dynamic Shared Memory Per Block byte/block 0 Static Shared Memory Per Block byte/block 0 Threads thread 65536 Waves Per SM 1.60 -------------------------------- --------------- --------------- OPT Estimated Speedup: 50% A wave of thread blocks is defined as the maximum number of blocks that can be executed in parallel on the target GPU. The number of blocks in a wave depends on the number of multiprocessors and the theoretical occupancy of the kernel. This kernel launch results in 1 full waves and a partial wave of 48 thread blocks. Under the assumption of a uniform execution duration of all thread blocks, the partial wave may account for up to 50.0% of the total kernel runtime with a lower occupancy of 21.1%. Try launching a grid with no partial wave. The overall impact of this tail effect also lessens with the number of full waves executed for a grid. See the Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more details on launch configurations. Section: Occupancy ------------------------------- ----------- ------------ Metric Name Metric Unit Metric Value ------------------------------- ----------- ------------ Block Limit SM block 16 Block Limit Registers block 2 Block Limit Shared Mem block 16 Block Limit Warps block 2 Theoretical Active Warps per SM warp 32 Theoretical Occupancy % 100 Achieved Occupancy % 78.94 Achieved Active Warps Per SM warp 25.26 ------------------------------- ----------- ------------ OPT Estimated Speedup: 21.06% This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated theoretical (100.0%) and measured achieved occupancy (78.9%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on optimizing occupancy.
You can add options “–set full” in the command line. And then you can open report in NCU GUI directly. Switch to the details page, you can see related info.
This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.