I am writing a simple function that uses the occupancy calculation functions to maximize the occupancy and dynamic shared memory size (in order of priority) of a cooperative kernel launch, but I ran into some interesting issues. I wrote the following MRO:
#include "helper_cuda.h"
__global__ void foo() {}
void foo_set_attributes() {
int value;
checkCudaErrors(cudaDeviceGetAttribute(
&value, cudaDevAttrMaxSharedMemoryPerBlockOptin, 0));
fprintf(stderr, "maxSharedMemoryPerBlockOptin = %d\n", value);
checkCudaErrors(cudaFuncSetAttribute(
foo, cudaFuncAttributeMaxDynamicSharedMemorySize, value));
}
void foo_print_occupancy() {
size_t dynamicSmemSize = 0;
// Do the following twice: once assuming no dynamic shared memory usage, and
// once assuming maximum dynamic shared memory usage.
for (size_t i = 0; i < 2; ++i) {
int minGridSize, blockSize, numBlocks;
// Get the suggested grid / block size pair that achieves the best potential
// occupancy (i.e. the maximum number of active warps with the smallest
// number of blocks).
checkCudaErrors(cudaOccupancyMaxPotentialBlockSize(&minGridSize, &blockSize,
foo, dynamicSmemSize));
// Get the the maximum number of active blocks per streaming multiprocessor
// for the device function.
checkCudaErrors(cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&numBlocks, foo, blockSize, dynamicSmemSize));
// Get the maximum size of dynamic shared memory to allow `numBlocks` blocks
// per SM
checkCudaErrors(cudaOccupancyAvailableDynamicSMemPerBlock(
&dynamicSmemSize, foo, numBlocks, blockSize));
fprintf(stderr,
"minGridSize = %d\nblockSize = %d\nnumBlocks = %d\ndynamicSmemSize "
"= %zu\n",
minGridSize, blockSize, numBlocks, dynamicSmemSize);
// Try to launch a cooperative kernel
cudaLaunchCooperativeKernel((const void *)foo, minGridSize, blockSize,
nullptr, dynamicSmemSize);
printLastCudaError("cooperative kernel launch");
}
}
int main() {
foo_set_attributes();
foo_print_occupancy();
}
On my machine (RTX 3090 rig running Windows 11; tested with CUDA Toolkit 12.0.1 AND 12.8.1), this program prints the following:
maxSharedMemoryPerBlockOptin = 101376
minGridSize = 164
blockSize = 768
numBlocks = 2
dynamicSmemSize = 51200
./main.cu(44) : getLastCudaError() CUDA error : cooperative kernel launch : (720) too many blocks in cooperative launch.
minGridSize = 82
blockSize = 1024
numBlocks = 1
dynamicSmemSize = 101376
This is⊠strange, to say the least. cudaDeviceGetAttribute
correctly reports that a maximum of 101,376 bytes (99 KB) of shared memory may be used per block; this is the expected value for devices of Compute Capability 8.6. However, cudaOccupancyAvailableDynamicSMemPerBlock
seems to be returning a slightly larger value than it should be. This causes the maximum number of active thread blocks per SM to decrease from 2 to 1 and cuts occupancy by approximately half.
Curiously, the returned value (51,200 bytes, or 50 KB) is exactly half the amount of shared memory per SM (and remember, there are supposed to be 2 active thread blocks per SM). As of CC 8.0, â1 KB of shared memory [per SM]⊠is reserved for system useâ, and it seems that cudaOccupancyAvailableDynamicSMemPerBlock
does not take this into account. Note that I tried compiling for my GPUâs native arch ("--generate-code=arch=compute_86,code=[sm_86]"
), so itâs not an issue of CC compilation differences.
Iâm posting this here in case I missed something, but otherwise this should definitely be fixed. Iâve already filed a bug report.