grid size estimation with cudaOccupancyMaxActiveBlocksPerMultiprocessor

grynet · May 17, 2017, 8:43pm

I am trying to use cudaOccupancyMaxActiveBlocksPerMultiprocessor on saxpy example in the following code. I am using 128 as thread block size. The API function returns me 16 for k80 kepler and p100 pascal. The number 16 totally makes sense since SM can host max 2048 threads. I configure gridsize of kernel following style
kepler → 16 * 13 (number of SM in k80)
pascal → 16 * 56 (number of SM in p100)
Please correct me If I am doing something wrong. I’d like to share my experience with cudaOccupancyMaxActiveBlocksPerMultiprocessor. Has anybody experiences using cudaOccupancyMaxActiveBlocksPerMultiprocessor ?

On the other hand, If I didn’t use the API, I would try to map kernel size with my vector size (which is N). my kernel would be <<<65536,128>>> and it yields better performance for this example.

// N = 8388608
__global__ void saxpy(T* a, T* b, T x, int N) {  
   int i;
  
  /* BlockId is statically assigned by blockIdx.x */
  for (i = threadIdx.x + blockIdx.x * blockDim.x;
       i < N; 
       i += blockDim.x*gridDim.x) 
     {
         b[i] = x * a[i] + b[i];
     }  
}

BulatZiganshin · May 17, 2017, 9:20pm

gridsize usually should be configured to amount of your work. ideally, it should be at least 10-20x more than can be simultaneously executed on GPU in order to lessen “tail effect”. Alternatively, you can run mltiple CUDA streams to load multiple jobs to GPU, so you can reduce size of each job

grynet · May 17, 2017, 9:46pm

Yes. However in my case my amount work is big. So my question is that If I can occupy device by cudaOccupancyMaxActiveBlocksPerMultiprocessor, why do I need to create bigger grid ?

BulatZiganshin · May 17, 2017, 10:24pm

you can run 16*13 grids. but in this case the kernel will finish when last of these grids will finish. it’s possible that most grids will fisnish in 1 second f.e., but last grid will finish only in 2 seconds. it’s called “tail effect”

Topic		Replies	Views
Best general alignment practices for kernel launches CUDA Programming and Performance	6	891	November 20, 2018
About Grid size selection CUDA Programming and Performance	4	1248	May 17, 2017
maximum threads per block CUDA Programming and Performance	2	972	October 29, 2014
Max gridDim.x ? CUDA Programming and Performance	7	4613	March 11, 2010
block numbers related to the number of SMs blocks in multiple SMs CUDA Programming and Performance	1	1455	December 1, 2009
Grid size limit of concurrent kernels CUDA Programming and Performance	6	1023	April 5, 2024
Grid size and occupancy calculator CUDA Programming and Performance	0	642	January 21, 2014
Optimal Execution Configuration ? best choice for grid and block sizes CUDA Programming and Performance	3	3239	March 12, 2009
Mapping of Blocks to MPs / Threads to MPs CUDA Programming and Performance	1	639	November 19, 2013
Grid dimension's decision How to take decision for organization of a grid . CUDA Programming and Performance	6	5544	March 10, 2009

grid size estimation with cudaOccupancyMaxActiveBlocksPerMultiprocessor

Related topics