cudaOccupancyAvailableDynamicSMemPerBlock returning incorrect value

QuaternionsRock · April 8, 2025, 3:36am

I am writing a simple function that uses the occupancy calculation functions to maximize the occupancy and dynamic shared memory size (in order of priority) of a cooperative kernel launch, but I ran into some interesting issues. I wrote the following MRO:

#include "helper_cuda.h"

__global__ void foo() {}

void foo_set_attributes() {
  int value;
  checkCudaErrors(cudaDeviceGetAttribute(
      &value, cudaDevAttrMaxSharedMemoryPerBlockOptin, 0));
  fprintf(stderr, "maxSharedMemoryPerBlockOptin = %d\n", value);
  checkCudaErrors(cudaFuncSetAttribute(
      foo, cudaFuncAttributeMaxDynamicSharedMemorySize, value));
}

void foo_print_occupancy() {
  size_t dynamicSmemSize = 0;

  // Do the following twice: once assuming no dynamic shared memory usage, and
  // once assuming maximum dynamic shared memory usage.
  for (size_t i = 0; i < 2; ++i) {
    int minGridSize, blockSize, numBlocks;

    // Get the suggested grid / block size pair that achieves the best potential
    // occupancy (i.e. the maximum number of active warps with the smallest
    // number of blocks).
    checkCudaErrors(cudaOccupancyMaxPotentialBlockSize(&minGridSize, &blockSize,
                                                       foo, dynamicSmemSize));
    // Get the the maximum number of active blocks per streaming multiprocessor
    // for the device function.
    checkCudaErrors(cudaOccupancyMaxActiveBlocksPerMultiprocessor(
        &numBlocks, foo, blockSize, dynamicSmemSize));
    // Get the maximum size of dynamic shared memory to allow `numBlocks` blocks
    // per SM
    checkCudaErrors(cudaOccupancyAvailableDynamicSMemPerBlock(
        &dynamicSmemSize, foo, numBlocks, blockSize));

    fprintf(stderr,
            "minGridSize = %d\nblockSize = %d\nnumBlocks = %d\ndynamicSmemSize "
            "= %zu\n",
            minGridSize, blockSize, numBlocks, dynamicSmemSize);

    // Try to launch a cooperative kernel
    cudaLaunchCooperativeKernel((const void *)foo, minGridSize, blockSize,
                                nullptr, dynamicSmemSize);
    printLastCudaError("cooperative kernel launch");
  }
}

int main() {
  foo_set_attributes();
  foo_print_occupancy();
}

On my machine (RTX 3090 rig running Windows 11; tested with CUDA Toolkit 12.0.1 AND 12.8.1), this program prints the following:

maxSharedMemoryPerBlockOptin = 101376
minGridSize = 164
blockSize = 768
numBlocks = 2
dynamicSmemSize = 51200
./main.cu(44) : getLastCudaError() CUDA error : cooperative kernel launch : (720) too many blocks in cooperative launch.
minGridSize = 82
blockSize = 1024
numBlocks = 1
dynamicSmemSize = 101376

This is… strange, to say the least. cudaDeviceGetAttribute correctly reports that a maximum of 101,376 bytes (99 KB) of shared memory may be used per block; this is the expected value for devices of Compute Capability 8.6. However, cudaOccupancyAvailableDynamicSMemPerBlock seems to be returning a slightly larger value than it should be. This causes the maximum number of active thread blocks per SM to decrease from 2 to 1 and cuts occupancy by approximately half.

Curiously, the returned value (51,200 bytes, or 50 KB) is exactly half the amount of shared memory per SM (and remember, there are supposed to be 2 active thread blocks per SM). As of CC 8.0, “1 KB of shared memory [per SM]… is reserved for system use”, and it seems that cudaOccupancyAvailableDynamicSMemPerBlock does not take this into account. Note that I tried compiling for my GPU’s native arch ("--generate-code=arch=compute_86,code=[sm_86]"), so it’s not an issue of CC compilation differences.

I’m posting this here in case I missed something, but otherwise this should definitely be fixed. I’ve already filed a bug report.

paleonix · April 8, 2025, 9:57am

One can argue about this being an unexpected value/sharp edge instead of an incorrect one/bug given that the user can query and subtract the per-block reserved memory, but at the very least the documentation should be clear about what this function returns.

QuaternionsRock · April 8, 2025, 1:36pm

~~First, I think it’s important to note that there isn’t always* such thing as “per-block reserved memory”, only per-SM reserved memory.~~ (Update: I was incorrect about there being “reserved shared memory per SM”; there is indeed only “reserved shared memory per block”.) Second, maybe I missed them, but I was unable to find an API for querying either per-block or per-SM reserved memory. Could you point me in that direction?

*Testing on my RTX 3090 indicated that there is no per-block reserved memory with CC 8.6, but further testing on an H100 appears to indicate that there is with CC 9.0. I’ll write a follow-up on this soon.

paleonix · April 8, 2025, 1:47pm

cudaGetDeviceProperties() gives you access to reservedSharedMemPerBlock. It being zero for certain architectures/GPUs would not be a problem.

I just saw that this entry in the device properties is missing from the struct declaration in the documentation and only listed below that, i.e. another documentation “bug”.

QuaternionsRock · April 8, 2025, 3:49pm

Upon further consideration, this is a documentation bug at the very least. The documentation for cudaOccupancyAvailableDynamicSMemPerBlockstates that it “returns in *dynamicSmemSize the maximum size of dynamic shared memory to allow numBlocks blocks per SM”, yet actually using dynamicSmemSize bytes of dynamic shared memory per thread block does not allow numBlocks blocks per SM.

Furthermore, I still contend that this is a proper bug, as cudaDevAttrMaxSharedMemoryPerBlockOptin correctly accounts for reserved shared memory, and cudaDevAttrReservedSharedMemoryPerBlock alone is not sufficient for computing the actual maximum size of dynamic shared memory per block. It’s also worth noting that cudaDevAttrMaxSharedMemoryPerMultiprocessor correctly does not account for reserved shared memory, thereby reinforcing the idea that values relating to “maximum amount of shared memory per block” should account for reserved shared memory.

QuaternionsRock · April 8, 2025, 5:02pm

I’d be curious to see if cudaFuncSetCacheConfig/cudaFuncAttributePreferredSharedMemoryCarveout changes the return value of cudaOccupancyAvailableDynamicSMemPerBlock on Kepler GPUs, or what the return value is if it doesn’t. Does it return the correct value according to the preferred carveout, or does it always return the value of cudaDevAttrMaxSharedMemoryPerMultiprocessor? Unfortunately, I don’t have a Kepler GPU to experiment with.

paleonix · April 8, 2025, 5:22pm

Volta and newer GPUs also have unified shared memory and L1 (+ Texture Cache), so you don’t need ancient Fermi/Kepler to check that. See e.g. 1. NVIDIA Ampere GPU Architecture Tuning Guide — NVIDIA Ampere Tuning Guide 12.8 documentation for your 3090.

QuaternionsRock · April 8, 2025, 5:46pm

It’s also worth noting that it isn’t as simple as “manually subtracting reservedSharedMemPerBlock from the output of cudaOccupancyAvailableDynamicSMemPerBlock()”; you also have to account for the possibility that the returned value is already less than or equal to the “true” maximum. This is the simplest rendition I can think of:

int device;
cudaGetDevice(&device);

int sharedMemPerMultiprocessor, reservedSharedMemPerBlock;
cudaDeviceGetAttribute(&sharedMemPerMultiprocessor, cudaDevAttrMaxSharedMemoryPerMultiprocessor, device);
cudaDeviceGetAttribute(&reservedSharedMemPerBlock, cudaDevAttrReservedSharedMemoryPerBlock, device);

size_t dynamicSmemSize;
cudaOccupancyAvailableDynamicSMemPerBlock(&dynamicSmemSize, func, numBlocks, blockSize);

dynamicSmemSize = min(dynamicSmemSize, sharedMemPerMultiprocessor / numBlocks - reservedSharedMemPerBlock);

And at this point, I’m not sure why you would use cudaOccupancyAvailableDynamicSMemPerBlock, as this is exactly equivalent:

int device;
cudaGetDevice(&device);

int sharedMemPerMultiprocessor, reservedSharedMemPerBlock;
cudaDeviceGetAttribute(&sharedMemPerMultiprocessor, cudaDevAttrMaxSharedMemoryPerMultiprocessor, device);
cudaDeviceGetAttribute(&reservedSharedMemPerBlock, cudaDevAttrReservedSharedMemoryPerBlock, device);

size_t dynamicSmemSize;
{
  cudaFuncAttributes attr;
  cudaFuncGetAttributes(&attr, func);
  dynamicSmemSize = attr.maxDynamicSharedSizeBytes;
}
dynamicSmemSize = min(dynamicSmemSize, sharedMemPerMultiprocessor / numBlocks - reservedSharedMemPerBlock);

It just feels like the user is doing most of the heavy lifting here, and it doesn’t feel like intended behavior.

QuaternionsRock · April 8, 2025, 5:55pm

If func’s cudaFuncAttributeMaxDynamicSharedMemorySize is not set close to the value of cudaDevAttrMaxSharedMemoryPerBlockOptin (e.g. the default is 48KB) and/or numBlocks == 1, you’d be subtracting 1KB of shared memory for no reason.

paleonix · April 8, 2025, 6:52pm

Honestly, after playing around with the code and reading the docs again, I’m not quite sure what I got hung up on earlier. This is clearly a functional bug. The docs and even the API name contain “dynamic shared memory” and there is no reason to believe reserved shared memory is part of that. Have you filed it as a bug already?

paleonix · April 8, 2025, 6:58pm

Another interesting property of the bug is that once you use any static shared memory in the kernel (in a way that con’t be optimized away by the compiler), the problem seems to vanish. I.e. cudaOccupancyAvailableDynamicSMemPerBlock() does respect reserved shared memory, but only if there is also static shared memory to be considered.

Yuki_Ni · April 9, 2025, 9:56am

Hi , this responds to NVBUG ID 5209234 that we are in tracking with the CUDA engineering team .

You are correct that what happens here is “Note that the maximum amount of shared memory per thread block is smaller than the maximum shared memory partition available per SM. The 1 KB of shared memory not made available to a thread block is reserved for system use.” See 1. Introduction — CUDA C++ Programming Guide.

And looks like ‘cudaDevAttrMaxSharedMemoryPerBlockOptin’ reports reserved shared memory deducted . But ‘cudaOccupancyAvailableDynamicSMemPerBlock’ still reports the original .

We will follow up with engineering team and bring back conclusion here .

system · April 23, 2025, 9:56am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
[ON HOLD] Issue with cuda_occupancy and cudaDeviceSetCacheConfig(...) CUDA Programming and Performance	7	2312	June 26, 2018
CUDA Occupancy Calculator Helps pick optimal thread block size CUDA Programming and Performance	76	312112	September 13, 2011
CUDA Pro Tip: Occupancy API Simplifies Launch Configuration Technical Blog	12	687	February 21, 2017
are the threads of a warp run serially？ CUDA Programming and Performance	9	1100	February 27, 2020
too many resources requested for launch CUDA Programming and Performance	28	24836	December 1, 2010
Confusion regarding register use and occupancy calculator CUDA Programming and Performance	22	2941	June 18, 2010
Number of kilobytes transferred to/from shared memory twice the expected CUDA Programming and Performance	12	703	September 29, 2018
Maximizing the number of threads per block leads to longer kernel execution times CUDA Programming and Performance cuda , kernel	12	1770	December 19, 2023
Confusion about setting kernel block and grid size for maximum occupancy CUDA Programming and Performance cuda	11	783	March 30, 2024
cuda memory issues Legacy PGI Compilers	13	9511	March 19, 2012

cudaOccupancyAvailableDynamicSMemPerBlock returning incorrect value

Related topics