I am using a Tesla K80 device. I obtained the number of active blocks per SM (calculated based on register and shared memory usage of each thread block) using cudaOccupancyMaxActiveBlocksPerMultiprocessor. However, how do I get the number of active SMs ? The metric ‘sm_efficiency’ reports 99.9% averaged over all SMs. Does it mean all 13 SMs of the device are active 99.9% of the cycles ?
Could you please explain what you mean by enough blocks ? Does this mean if the number of blocks in the kernel is at least (total number of SMs in the device * number of active blocks per SM), all SMs will be used ?Thank you very much for your response.
Thank you for clarifying. I was prompted to ask this question because I don’t see a difference between values reported in sm_efficiency and sm_efficiency_instance even though the former is supposed to be an average over all SMs. I thought, in corner cases where I only launch # blocks < # SMs, I should see a difference, but I don’t.
A historical rule of thumb is that one should strive for a grid to comprise > 20 * (number of SMx * number of active blocks/SM) blocks total, to achieve optimal efficiency (e.g. with respect to memory controllers) and minimize the impact of tail effects. I suspect that Maxwell and later architectures will reach maximum efficiency for smaller grids due to more fine-grained control mechanisms, but I haven’t looked into this experimentally.
sm_efficiency is essentially the number of active issue slots (the number of slots where one or more instructions actually got issued) divided by the total issue slots for the duration of SM activity. Since the latter number would be zero for an unused/inactive SM, I don’t think it is included in calculations.