Originally published at: https://developer.nvidia.com/blog/cuda-pro-tip-occupancy-api-simplifies-launch-configuration/
CUDA programmers often need to decide on a block size to use for a kernel launch. For key kernels, its important to understand the constraints of the kernel and the GPU it is running on to choose a block size that will result in good performance. One common heuristic used to choose a good block…
Nice. That looks quite useful!
Cooooooool!
How does it look when we try 2d or even 3d block?
For now you will need to compute your own 2D/3D block dimensions from the 1D thread counts suggested by the API.
Hello Mark,
This API looks great. I compiled the example you provided above using CUDA 6.5 install. Also wanted to comment that I got a warning concerning the method signature for the kernel parameter.
$ nvcc example_occupancy.cu
/usr/local/cuda-6.5/bin/../targets/x86_64-linux/include/cuda_runtime.h(1394): warning: argument of type "void (*)(int *, int)" is incompatible with parameter of type "const void *"
detected during:
instantiation of "cudaError_t <unnamed>::cudaOccupancyMaxPotentialBlockSizeVariableSMem(int *, int *, T, UnaryFunction, int) [with UnaryFunction=<unnamed>::__cudaOccupancyB2DHelper, T=void (*)(int *, int)]"
(1278): here
instantiation of "cudaError_t <unnamed>::cudaOccupancyMaxPotentialBlockSize(int *, int *, T, size_t, int) [with T=void (*)(int *, int)]"
example_occupancy.cu(19):
Nevertheless the code is running fine. I just wanted to tell in case someone else experienced this. I should also tell my compiler is gcc
$ gcc --version
gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3
Cheers,
Launched blocks of size 768. Theoretical occupancy: 0.000000
GPU - Tesla C2075
Why I have 0 occupancy when I use cudaSetDevice and GPU provided above ?
What are you using to measure Theoretical occupancy? What are the resources used by your kernel (registers per thread, shared memory per block)?
Hi, very helpful, thanks! However I have a kernel where the amount of shared memory depends on the block dimensions, what can I do in this case?
There's a C++ version of the API which takes a unary function callback as an argument. You define this function to take a block size and return a dynamic shared memory size in bytes, and the API uses this in its calculations. See http://docs.nvidia.com/cuda...
Is it possible for these values to change at runtime?
float occupancy = (maxActiveBlocks * blockSize / props.warpSize) /
(float)(props.maxThreadsPerMultiProcessor /
props.warpSize);
why do we divide twice by props.warpSize ??? it's a redundant operation that can be mathematically simplified
occupancy = maxActiveBlocks * blockSize / props.maxThreadsPerMultiProcessor;
Your calculation is semantically different because it ignores integer division. Remember that blockSize might not be a multiple of warpSize (although that's generally not a good idea, it's legal).