I recommend reviewing the Nsight VSE CUDA Profiler documentation on
These metrics are all related to the Streaming Multiprocessor (SM).
achieved_occupancy is the ratio of active warps (warps resident on SM actively being scheduled) to the maximum number of warps the SM can support. The higher the ooccupancy the more likely the warp scheduler can hide latency. The higher the occupancy likely the lower the number of resources (e.g. registers/thread) per warp.
inst_per_warp Is the average number of instructions executed per warp.
sm_efficiency Is the ratio of cycles that a SM had at least 1 active warp to the total number of cycles executed in the measurement. sm_activity is a more accurate name. If sm_efficiency is less than 90% then either there was insufficient work launched (increase thread blocks per launch) or the kernel has a bad tail effect (subset or blocks/warps run longer than the rest). Fix this first.
stall_not_selected Is the percentage of active warps that were ready to issue an instruction but the warp scheduler picked a higher priority warp. Is this number is high then part or all of the kernel has sufficient occupancy (active_warps) to hide instruction latency. If this number is really high then it may be worth decreasing occupancy by trying to use more registers/thread. Each cycle each warp scheduler can pick one eligible warp (active warp that is not stalled) to issue instructions. If there are multiple eligible warps then 1 warp will report the reason selected and the other eligible warps will report not selected.
warp_execution_efficiency is the ratio of average active threads per warp per instruction executed to the maximum number of threads per instruction (warp_size = 32). If this is less than 100% then the kernel has either thread divergence or the kernel was not launched with a multiple of 32 threads per block.
eligible_warps_per_cycle is the number of active warps per cycle that are not stalled. I believe CUPTI measures this at the SM level. In order to issue at maximum rate the SM warp schedulers each have to have 1 eligible warp so for most architectures this number has to be at least 4 so that each warp scheduler has 1.