grid size estimation with cudaOccupancyMaxActiveBlocksPerMultiprocessor

I am trying to use cudaOccupancyMaxActiveBlocksPerMultiprocessor on saxpy example in the following code. I am using 128 as thread block size. The API function returns me 16 for k80 kepler and p100 pascal. The number 16 totally makes sense since SM can host max 2048 threads. I configure gridsize of kernel following style
kepler --> 16 * 13 (number of SM in k80)
pascal --> 16 * 56 (number of SM in p100)
Please correct me If I am doing something wrong. I’d like to share my experience with cudaOccupancyMaxActiveBlocksPerMultiprocessor. Has anybody experiences using cudaOccupancyMaxActiveBlocksPerMultiprocessor ?

On the other hand, If I didn’t use the API, I would try to map kernel size with my vector size (which is N). my kernel would be <<<65536,128>>> and it yields better performance for this example.

// N = 8388608
__global__ void saxpy(T* a, T* b, T x, int N) {  
   int i;
  /* BlockId is statically assigned by blockIdx.x */
  for (i = threadIdx.x + blockIdx.x * blockDim.x;
       i < N; 
       i += blockDim.x*gridDim.x) 
         b[i] = x * a[i] + b[i];

gridsize usually should be configured to amount of your work. ideally, it should be at least 10-20x more than can be simultaneously executed on GPU in order to lessen “tail effect”. Alternatively, you can run mltiple CUDA streams to load multiple jobs to GPU, so you can reduce size of each job

Yes. However in my case my amount work is big. So my question is that If I can occupy device by cudaOccupancyMaxActiveBlocksPerMultiprocessor, why do I need to create bigger grid ?

you can run 16*13 grids. but in this case the kernel will finish when last of these grids will finish. it’s possible that most grids will fisnish in 1 second f.e., but last grid will finish only in 2 seconds. it’s called “tail effect”