Number of active SMs

I am using a Tesla K80 device. I obtained the number of active blocks per SM (calculated based on register and shared memory usage of each thread block) using cudaOccupancyMaxActiveBlocksPerMultiprocessor. However, how do I get the number of active SMs ? The metric ‘sm_efficiency’ reports 99.9% averaged over all SMs. Does it mean all 13 SMs of the device are active 99.9% of the cycles ?

==13157== Profiling application: ./mt 4096 4096 4 PINNED_COPY
==13157== Profiling result:
==13157== Event result:
Invocations Event Name Min Max Avg
Device “Tesla K80 (0)”
Kernel: d_mult(int, int, int, float*, float*, float*, int)
1 elapsed_cycles_sm 86269220 86269220 86269220
1 active_cycles 86184600 86184600 86184600
1 active_warps 5175246460 5175246460 5175246460
==13157== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device “Tesla K80 (0)”
Kernel: d_mult(int, int, int, float*, float*, float*, int)
1 sm_efficiency Multiprocessor Activity 99.90% 99.90% 99.90%
1 sm_efficiency_instance Multiprocessor Activity 99.90% 99.90% 99.90%

If you have a single kernel launch with enough blocks, all SMs should be active.

Could you please explain what you mean by enough blocks ? Does this mean if the number of blocks in the kernel is at least (total number of SMs in the device * number of active blocks per SM), all SMs will be used ?Thank you very much for your response.

of blocks >= # of SMs

to cover edges it might be better to have # of blocks >= 2x # of SMs

That should guarantee that the work distributor places at least one block on each SM.

If you want really even loading (minimize percentage difference between SM utilization) you may want to use a large number of blocks, however.

Thank you for clarifying. I was prompted to ask this question because I don’t see a difference between values reported in sm_efficiency and sm_efficiency_instance even though the former is supposed to be an average over all SMs. I thought, in corner cases where I only launch # blocks < # SMs, I should see a difference, but I don’t.

A historical rule of thumb is that one should strive for a grid to comprise > 20 * (number of SMx * number of active blocks/SM) blocks total, to achieve optimal efficiency (e.g. with respect to memory controllers) and minimize the impact of tail effects. I suspect that Maxwell and later architectures will reach maximum efficiency for smaller grids due to more fine-grained control mechanisms, but I haven’t looked into this experimentally.

I don’t think sm_efficiency tracks unused SMs

sm_efficiency is essentially the number of active issue slots (the number of slots where one or more instructions actually got issued) divided by the total issue slots for the duration of SM activity. Since the latter number would be zero for an unused/inactive SM, I don’t think it is included in calculations.

Txbob, If possible could you please help me with the question I posted here:
https://devtalk.nvidia.com/default/topic/963199/nvprof-elapsed_cycles_sm-vs-time-in-milliseconds-/#4969130

I am having some trouble understanding this. I would really appreciate any help I get.

Thank you njuffa and txbob for the clarification.