I am trying to use cudaOccupancyMaxActiveBlocksPerMultiprocessor on saxpy example in the following code. I am using 128 as thread block size. The API function returns me 16 for k80 kepler and p100 pascal. The number 16 totally makes sense since SM can host max 2048 threads. I configure gridsize of kernel following style
kepler → 16 * 13 (number of SM in k80)
pascal → 16 * 56 (number of SM in p100)
Please correct me If I am doing something wrong. I’d like to share my experience with cudaOccupancyMaxActiveBlocksPerMultiprocessor. Has anybody experiences using cudaOccupancyMaxActiveBlocksPerMultiprocessor ?
On the other hand, If I didn’t use the API, I would try to map kernel size with my vector size (which is N). my kernel would be <<<65536,128>>> and it yields better performance for this example.
// N = 8388608
__global__ void saxpy(T* a, T* b, T x, int N) {
int i;
/* BlockId is statically assigned by blockIdx.x */
for (i = threadIdx.x + blockIdx.x * blockDim.x;
i < N;
i += blockDim.x*gridDim.x)
{
b[i] = x * a[i] + b[i];
}
}