Theoretic occupancy is calculated assuming that there are enough blocks available to “fill” all the SMs. It provides an upper bound on how many blocks of a kernel can execute simultaneously on each SM. You can find more information about theoretic occupancy in the CUDA Best Practices Guide at docs.nvidia.com.
Theoretic occupany is limited by threads per block, registers per thread, and shared-memory usage per block. Assuming that register and shared memory usage are low, your kernel would have a theoretical occupancy of 100% because 8 blocks could execute on each SM ((8 * 256) / 2056) = 1.0.
Achieved occupany is an actual measure of the occupancy achieved across all SMs on the GPU. It is always bounded by theoretic occupancy but can be lower when all SMs are not equally busy over the entire duration of the kernel.
If you take the kernel from the cupti_metric example and run it through nvprof or nvvp you can see both the theoretic and achieved occupancies (you can’t run the example directly because it calls the cupti API directly and nvprof and nvvp don’t support apps that call cupti directly). You can also try using the occupancy calculator that is included in the toolkit. It is a spreadsheet that lets you experiment with different kernel parameters.