CUDA Pro Tip: Occupancy API Simplifies Launch Configuration

Originally published at:

CUDA programmers often need to decide on a block size to use for a kernel launch. For key kernels, its important to understand the constraints of the kernel and the GPU it is running on to choose a block size that will result in good performance. One common heuristic used to choose a good block…

Nice. That looks quite useful!


How does it look when we try 2d or even 3d block?

For now you will need to compute your own 2D/3D block dimensions from the 1D thread counts suggested by the API.

Hello Mark,

This API looks great. I compiled the example you provided above using CUDA 6.5 install. Also wanted to comment that I got a warning concerning the method signature for the kernel parameter.

$ nvcc
/usr/local/cuda-6.5/bin/../targets/x86_64-linux/include/cuda_runtime.h(1394): warning: argument of type "void (*)(int *, int)" is incompatible with parameter of type "const void *"
detected during:
instantiation of "cudaError_t <unnamed>::cudaOccupancyMaxPotentialBlockSizeVariableSMem(int *, int *, T, UnaryFunction, int) [with UnaryFunction=<unnamed>::__cudaOccupancyB2DHelper, T=void (*)(int *, int)]"
(1278): here
instantiation of "cudaError_t <unnamed>::cudaOccupancyMaxPotentialBlockSize(int *, int *, T, size_t, int) [with T=void (*)(int *, int)]"

Nevertheless the code is running fine. I just wanted to tell in case someone else experienced this. I should also tell my compiler is gcc
$ gcc --version
gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3


Launched blocks of size 768. Theoretical occupancy: 0.000000

GPU - Tesla C2075

Why I have 0 occupancy when I use cudaSetDevice and GPU provided above ?

What are you using to measure Theoretical occupancy? What are the resources used by your kernel (registers per thread, shared memory per block)?

Hi, very helpful, thanks! However I have a kernel where the amount of shared memory depends on the block dimensions, what can I do in this case?

There's a C++ version of the API which takes a unary function callback as an argument. You define this function to take a block size and return a dynamic shared memory size in bytes, and the API uses this in its calculations. See

Is it possible for these values to change at runtime?

float occupancy = (maxActiveBlocks * blockSize / props.warpSize) /
(float)(props.maxThreadsPerMultiProcessor /
why do we divide twice by props.warpSize ??? it's a redundant operation that can be mathematically simplified
occupancy = maxActiveBlocks * blockSize / props.maxThreadsPerMultiProcessor;

Your calculation is semantically different because it ignores integer division. Remember that blockSize might not be a multiple of warpSize (although that's generally not a good idea, it's legal).