Originally published at: https://developer.nvidia.com/blog/using-nsight-compute-to-inspect-your-kernels/
By now, hopefully you read the first two blogs in this series “Migrating to NVIDIA Nsight Tools from NVVP and Nvprof” and “Transitioning to Nsight Systems from NVIDIA Visual Profiler / nvprof,” and you’ve discovered NVIDIA added a few new tools, both Nsight Compute and Nsight Systems, to the repertoire of CUDA tools available for…
How am I able to use Nsight Compute with python scripts?
For general questions about using the tools, as indicated in the blog article, the suggestion is to ask those questions on the tools forum. For nsight compute, that is here: Nsight Compute - NVIDIA Developer Forums
Generally speaking, there shouldn’t be anything special required to use nsight compute with python scripts. Using the sample python/cupy code here: python 3.x - cupy indexing is slow - Stack Overflow If I do:
ncu python t1.py
I get output like this:
$ ncu --version
NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2012-2020 NVIDIA Corporation
Version 2020.1.0 (Build 28294165)
$ ncu python t1.py
==PROF== Connected to process 57559 (/home/nvidia/anaconda3/bin/python3.7)
==PROF== Profiling "generate_seed_pseudo" - 1: 0%....50%....100% - 19 passes
==PROF== Profiling "gen_sequenced" - 2: 0%....50%....100% - 19 passes
==PROF== Profiling "cupy_multiply" - 3: 0%....50%....100% - 19 passes
==PROF== Profiling "cupy_add" - 4: 0%....50%....100% - 19 passes
==PROF== Profiling "clamp_generic" - 5: 0%....50%....100% - 19 passes
==PROF== Disconnected from process 57559
[57559] python3.7@127.0.0.1
generate_seed_pseudo(unsigned long long, unsigned long long, unsigned long long, curandOrdering, curandStateXORWOW*, unsigned int*), 2020-Aug-31 14:38:59, Context 1, Stream 7
Section: GPU Speed Of Light
---------------------------------------------------------------------- --------------- ------------------------------
DRAM Frequency cycle/usecond 867.55
SM Frequency cycle/nsecond 1.28
Elapsed Cycles cycle 1,427,785
Memory [%] % 34.91
SOL DRAM % 0.04
Duration msecond 1.11
SOL L1/TEX Cache % 53.14
SOL L2 Cache % 34.91
SM Active Cycles cycle 797,969.30
SM [%] % 5.06
---------------------------------------------------------------------- --------------- ------------------------------
WRN This kernel grid is too small to fill the available resources on this device. Look at Launch Statistics for
more details.
Section: Launch Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Block Size 64
Grid Size 64
Registers Per Thread register/thread 48
Shared Memory Configuration Size byte 0
Driver Shared Memory Per Block byte/block 0
Dynamic Shared Memory Per Block byte/block 0
Static Shared Memory Per Block byte/block 0
Threads thread 4,096
Waves Per SM 0.04
---------------------------------------------------------------------- --------------- ------------------------------
WRN The grid for this launch is configured to execute only 64 blocks, which is less than the GPU's 80
multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel
concurrently with other workloads, consider reducing the block size to have at least one block per
multiprocessor or increase the size of the grid to fully utilize the available hardware resources.
Section: Occupancy
---------------------------------------------------------------------- --------------- ------------------------------
Block Limit SM block 32
Block Limit Registers block 20
Block Limit Shared Mem block 32
Block Limit Warps block 32
Theoretical Active Warps per SM warp/cycle 40
Theoretical Occupancy % 62.50
Achieved Occupancy % 3.01
Achieved Active Warps Per SM warp 1.93
---------------------------------------------------------------------- --------------- ------------------------------
void gen_sequenced<curandStateXORWOW, double2, normal_args_double_st, &(double2 curand_normal_scaled2_double<curandStateXORWOW>(curandStateXORWOW*, normal_args_double_st))>(curandStateXORWOW*, double2*, unsigned long, unsigned long, normal_args_double_st), 2020-Aug-31 14:38:59, Context 1, Stream 7
Section: GPU Speed Of Light
---------------------------------------------------------------------- --------------- ------------------------------
DRAM Frequency cycle/usecond 597.09
SM Frequency cycle/usecond 880.71
Elapsed Cycles cycle 5,812
Memory [%] % 11.79
SOL DRAM % 4.88
Duration usecond 6.59
SOL L1/TEX Cache % 17.67
SOL L2 Cache % 11.79
SM Active Cycles cycle 3,159.89
SM [%] % 3.61
---------------------------------------------------------------------- --------------- ------------------------------
WRN This kernel grid is too small to fill the available resources on this device. Look at Launch Statistics for
more details.
Section: Launch Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Block Size 64
Grid Size 64
Registers Per Thread register/thread 36
Shared Memory Configuration Size byte 0
Driver Shared Memory Per Block byte/block 0
Dynamic Shared Memory Per Block byte/block 0
Static Shared Memory Per Block byte/block 0
Threads thread 4,096
Waves Per SM 0.03
---------------------------------------------------------------------- --------------- ------------------------------
WRN The grid for this launch is configured to execute only 64 blocks, which is less than the GPU's 80
multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel
concurrently with other workloads, consider reducing the block size to have at least one block per
multiprocessor or increase the size of the grid to fully utilize the available hardware resources.
Section: Occupancy
---------------------------------------------------------------------- --------------- ------------------------------
Block Limit SM block 32
Block Limit Registers block 24
Block Limit Shared Mem block 32
Block Limit Warps block 32
Theoretical Active Warps per SM warp/cycle 48
Theoretical Occupancy % 75
Achieved Occupancy % 3.07
Achieved Active Warps Per SM warp 1.96
---------------------------------------------------------------------- --------------- ------------------------------
cupy_multiply, 2020-Aug-31 14:39:00, Context 1, Stream 7
Section: GPU Speed Of Light
---------------------------------------------------------------------- --------------- ------------------------------
DRAM Frequency cycle/usecond 669.64
SM Frequency cycle/usecond 990.33
Elapsed Cycles cycle 3,552
Memory [%] % 5.21
SOL DRAM % 5.21
Duration usecond 3.58
SOL L1/TEX Cache % 4.15
SOL L2 Cache % 3.76
SM Active Cycles cycle 1,555.72
SM [%] % 1.83
---------------------------------------------------------------------- --------------- ------------------------------
WRN This kernel grid is too small to fill the available resources on this device. Look at Launch Statistics for
more details.
Section: Launch Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Block Size 128
Grid Size 125
Registers Per Thread register/thread 18
Shared Memory Configuration Size byte 0
Driver Shared Memory Per Block byte/block 0
Dynamic Shared Memory Per Block byte/block 0
Static Shared Memory Per Block byte/block 0
Threads thread 16,000
Waves Per SM 0.10
---------------------------------------------------------------------- --------------- ------------------------------
WRN If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the
achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the
hardware busy.
Section: Occupancy
---------------------------------------------------------------------- --------------- ------------------------------
Block Limit SM block 32
Block Limit Registers block 21
Block Limit Shared Mem block 32
Block Limit Warps block 16
Theoretical Active Warps per SM warp/cycle 64
Theoretical Occupancy % 100
Achieved Occupancy % 8.97
Achieved Active Warps Per SM warp 5.74
---------------------------------------------------------------------- --------------- ------------------------------
cupy_add, 2020-Aug-31 14:39:00, Context 1, Stream 7
Section: GPU Speed Of Light
---------------------------------------------------------------------- --------------- ------------------------------
DRAM Frequency cycle/usecond 650.49
SM Frequency cycle/usecond 961.01
Elapsed Cycles cycle 3,170
Memory [%] % 5.83
SOL DRAM % 5.83
Duration usecond 3.30
SOL L1/TEX Cache % 4.65
SOL L2 Cache % 4.21
SM Active Cycles cycle 1,387.59
SM [%] % 2.05
---------------------------------------------------------------------- --------------- ------------------------------
WRN This kernel grid is too small to fill the available resources on this device. Look at Launch Statistics for
more details.
Section: Launch Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Block Size 128
Grid Size 125
Registers Per Thread register/thread 18
Shared Memory Configuration Size byte 0
Driver Shared Memory Per Block byte/block 0
Dynamic Shared Memory Per Block byte/block 0
Static Shared Memory Per Block byte/block 0
Threads thread 16,000
Waves Per SM 0.10
---------------------------------------------------------------------- --------------- ------------------------------
WRN If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the
achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the
hardware busy.
Section: Occupancy
---------------------------------------------------------------------- --------------- ------------------------------
Block Limit SM block 32
Block Limit Registers block 21
Block Limit Shared Mem block 32
Block Limit Warps block 16
Theoretical Active Warps per SM warp/cycle 64
Theoretical Occupancy % 100
Achieved Occupancy % 9.00
Achieved Active Warps Per SM warp 5.76
---------------------------------------------------------------------- --------------- ------------------------------
clamp_generic, 2020-Aug-31 14:39:00, Context 1, Stream 7
Section: GPU Speed Of Light
---------------------------------------------------------------------- --------------- ------------------------------
DRAM Frequency cycle/usecond 650.49
SM Frequency cycle/usecond 959.80
Elapsed Cycles cycle 3,166
Memory [%] % 5.83
SOL DRAM % 5.83
Duration usecond 3.30
SOL L1/TEX Cache % 4.59
SOL L2 Cache % 4.21
SM Active Cycles cycle 1,404.49
SM [%] % 1.83
---------------------------------------------------------------------- --------------- ------------------------------
WRN This kernel grid is too small to fill the available resources on this device. Look at Launch Statistics for
more details.
Section: Launch Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Block Size 128
Grid Size 125
Registers Per Thread register/thread 16
Shared Memory Configuration Size byte 0
Driver Shared Memory Per Block byte/block 0
Dynamic Shared Memory Per Block byte/block 0
Static Shared Memory Per Block byte/block 0
Threads thread 16,000
Waves Per SM 0.10
---------------------------------------------------------------------- --------------- ------------------------------
WRN If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the
achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the
hardware busy.
Section: Occupancy
---------------------------------------------------------------------- --------------- ------------------------------
Block Limit SM block 32
Block Limit Registers block 32
Block Limit Shared Mem block 32
Block Limit Warps block 16
Theoretical Active Warps per SM warp/cycle 64
Theoretical Occupancy % 100
Achieved Occupancy % 9.00
Achieved Active Warps Per SM warp 5.76
---------------------------------------------------------------------- --------------- ------------------------------
$
I probably won’t be able to respond to further/detailed questions here. Please ask nsight compute usage questions on the forum I already linked.