I would like to manually check a test code for SM occupancy. So, I created a simple vector addition kernel (a[i]=a[i]+1) with the following grid/block sizes:
int M = atoi( argv ); int N = atoi( argv );
dim3 grid_size( M ); dim3 block_size( N );
increment<<< grid_size, block_size >>>( d_a, N );
Assuming Ampere with 68 SMs and 128 cores per SM, please see the following outputs
$ nvcc -o vec_add -arch=sm_86 -Xptxas -O3,-v vec_add.cu $ nv-nsight-cu-cli ./vec_add 1 128 Theoretical Occupancy % 100 Achieved Occupancy % 8,15 Achieved Active Warps Per SM warp 3,91 $ nv-nsight-cu-cli ./vec_add 1 1024 Theoretical Occupancy % 66,67 Achieved Occupancy % 62,50 Achieved Active Warps Per SM warp 30,00 $ nv-nsight-cu-cli ./vec_add 68 1024 Theoretical Occupancy % 66,67 Achieved Occupancy % 54,54 Achieved Active Warps Per SM warp 26,18
First, I want to know why theoretical occupancy is not 100% for block size=1024?
Second, I assume that (1,128) means one SM and 128 threads. Since an SM has 128 cores, I assume that each core receives one thread. However, I am aware of the fact that in reality this may not be correct as the scheduler may dispatch 4 warps to one sub-partition only (32 cores). Is there a way to check that?