Theoretical occupancy less than 100%

Hi
I would like to manually check a test code for SM occupancy. So, I created a simple vector addition kernel (a[i]=a[i]+1) with the following grid/block sizes:

int M = atoi( argv[1] ); int N = atoi( argv[2] );
dim3 grid_size( M ); dim3 block_size( N );
increment<<< grid_size, block_size >>>( d_a, N );

Assuming Ampere with 68 SMs and 128 cores per SM, please see the following outputs

$ nvcc -o vec_add -arch=sm_86 -Xptxas -O3,-v vec_add.cu 
$ nv-nsight-cu-cli ./vec_add 1 128
    Theoretical Occupancy                   %                100
    Achieved Occupancy                      %                8,15
    Achieved Active Warps Per SM           warp              3,91
$ nv-nsight-cu-cli ./vec_add 1 1024
    Theoretical Occupancy                   %                66,67
    Achieved Occupancy                      %                62,50
    Achieved Active Warps Per SM           warp              30,00
$ nv-nsight-cu-cli ./vec_add 68 1024
    Theoretical Occupancy                   %                66,67
    Achieved Occupancy                      %                54,54
    Achieved Active Warps Per SM           warp              26,18

First, I want to know why theoretical occupancy is not 100% for block size=1024?
Second, I assume that (1,128) means one SM and 128 threads. Since an SM has 128 cores, I assume that each core receives one thread. However, I am aware of the fact that in reality this may not be correct as the scheduler may dispatch 4 warps to one sub-partition only (32 cores). Is there a way to check that?

Your ampere GPU is cc8.6. Referring to table 15 of the programming guide, we see that the maximum thread complement (“Maximum number of resident threads per SM”) on a cc8.6 SM is 1536 threads. If you launch a kernel with threadblocks of size 1024, only one of these (at most) can fit on a SM. Therefore at most you are using 1024/1536 of the “capacity” of the SM, i.e. 66.67%. If you choose a threadblock size that is a factor of 1536, you should witness 100% theoretical occupancy, for a simple vector-add test case. The prime factors of 1536 are 3 and 2^9.

This is a very puzzling statement, first of all since I assume you wrote the code and second of all since you have not shown the code. But my guess would be that it is much more likely that it means 1 block of 128 threads:

dim3 grid_size( M ); dim3 block_size( N );

A block and an SM are not the same thing. Yes, its true that with a grid launch of 1 block, you can be certain that exactly 1 SM is being used, but this doesn’t hold for other values of M.

That isn’t how the CUDA execution model works. You should reject the notion that cores handle threads. They do not. For a detailed treatment of this topic, there are probably dozens or hundreds of questions on various forums that delve into this, or else this training (first 4 sessions) will give you a grounded introduction to CUDA that explains the execution model to some degree.

No, the scheduler won’t do that in this case. The scheduler distributes warps evenly to sub-partitions.

The code is shown below:

__global__ void increment( int *a, int N )
{
  int i = threadIdx.x;
  if ( i < N )
    a[ i ] = a[ i ] + 1;
}
int main(int argc, char *argv[])
{
  int M = atoi( argv[1] );
  int N = atoi( argv[2] );
  int h_a[ N ];
  for ( int i = 0; i < N; i++ )
    h_a[ i ] = i;

  int *d_a;
  cudaMalloc( (void **) &d_a, N * sizeof(int) );
  cudaMemcpy( d_a, h_a, N * sizeof(int), cudaMemcpyHostToDevice );

  dim3 grid_size( M ); dim3 block_size( N );
  increment<<< grid_size, block_size >>>( d_a, N );

  cudaMemcpy( h_a, d_a, N * sizeof(int), cudaMemcpyDeviceToHost );
  cudaFree( d_a );

  return 0;
}

Therefore at most you are using 1024/1536 of the “capacity” of the SM, i.e. 66.67%

OK I understand that. But as I showed earlier, for (1,128), the theoretical occupancy is also 100%. Based on your calculation, for 128, maximum should be 128/1536. Isn’t it?

Hi
May I know why <<<1,1536>>> is not correct? I know it exceeds 32 warps for Ampere but I was trying to match that with your statement.


$ nv-nsight-cu-cli ./vec_add 1 1536
==PROF== Connected to process 1099724 (/home/mahmood/cuTest/vec_add)
==PROF== Disconnected from process 1099724
==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.

$ nv-nsight-cu-cli ./vec_add 1 1025
==PROF== Connected to process 1099831 (/home/mahmood/cuTest/vec_add)
==PROF== Disconnected from process 1099831
==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.

$ nv-nsight-cu-cli ./vec_add 1 1024
==PROF== Connected to process 1099809 (/home/mahmood/cuTest/vec_add)
==PROF== Profiling "increment(int*, int)" - 1: 0%....50%....100% - 8 passes
==PROF== Disconnected from process 1099809
[1099809] vec_add@127.0.0.1
  increment(int*, int), 2021-Oct-11 13:48:16, Context 1, Stream 7
    Section: GPU Speed Of Light
...

I think Robert means “a factor < 1536”, as maximum block size is 1024 threads.

Correct. CUDA hasn’t changed in that the limit on threads per block is currently still 1024.

The factor you choose should be less than or equal to 1024 to satisfy this requirement. With that proviso, my suggestion (if you want to witness 100% theoretical) is to choose a threadblock size that will evenly divide into 1536. That is choose a threadblock size such that:

1536/threadblock size = a whole number

As already stated, your threadblock size chosen (due to other CUDA requirements) cannot exceed 1024.

Furthermore, if you choose a very small threadblock size (say, 32) you may run into other CUDA SM limits such as the number of threadblocks per SM.

To pick some examples, for threadblock sizes of 128, 256, 512, these will “evenly” divide into 1536 and you should witness 100% for the previous discussion.

The reason you can witness 100% here is because multiple threadblocks can be resident on a SM. If you choose 512, for example, then the way you get to 100% occupancy is to have 3 of those threadblocks resident on a SM at the same time.

Choosing a threadblock size of 1024 prevents any other threadblocks from becoming resident on a SM, therefore we never witness higher than 1024/1536 utilization of the SM warp slots.