Each SM has four sub-partitions, correct. The problem in understanding the numbers comes from the difference in what is counted (cycles vs instructions). As a suggestion, in such cases, it can be helpful to not only collect one sub-metric (.sum), but at least all first-level sub-metrics (sum/avg/min/max). There is no profiling overhead penalty for that. When using the CLI, you can omit the suffix to collect all first-level sub-metrics, e.g. --metrics sm__cycles_active,sm__inst_executed .
As a first example, consider that you have only one thread, or one warp, so that only one SMSP of one SM is active. In this case, you will find that the min, max and sum numbers for cycles between SM and SMSP actually (almost) match up, and the numbers for instructions are identical. (The averages will differ, as the number of SMs and SMSPs is not the same).
When going from one thread to say 256 threads in one block, all SMSPs of the first SM become active. Let’s assume SMSP 0 is active for 1500 cycles and SMSP 1-3 for 500 cycles, each. smsp__cycles_active.sum will be 3000. However, sm__cycles_active.sum will only be 1500 (or a little more), since it’s the sum across all SMs, not the sum across all SMSPs, and the 500 cycles of SMSPs 1-3 are overlapped or “hidden” by the 1500 cycles of the longest running SMSP 0. That’s because they don’t have separate cycles counters, they are on the same cycle.
Instructions are counted by a different signal in the HW and they don’t “overlap”, since each individual instruction is executed by itself (in contrast to cycles, where multiple SMSPs can be on the same cycle signal). As such, SMSPs might execute (15, 5, 5, 5) instructions, respectively, leading to smsp__inst_executed.sum being 30 (max being 15 and min 0, since all SMSPs of all SMs are considered). Nevertheless, sm__inst_executed.sum (and max) will also be 30, since the first SM truly executed 30 individual instructions.