cudaOccupancyAvailableDynamicSMemPerBlock returning incorrect value

I am writing a simple function that uses the occupancy calculation functions to maximize the occupancy and dynamic shared memory size (in order of priority) of a cooperative kernel launch, but I ran into some interesting issues. I wrote the following MRO:

#include "helper_cuda.h"

__global__ void foo() {}

void foo_set_attributes() {
  int value;
  checkCudaErrors(cudaDeviceGetAttribute(
      &value, cudaDevAttrMaxSharedMemoryPerBlockOptin, 0));
  fprintf(stderr, "maxSharedMemoryPerBlockOptin = %d\n", value);
  checkCudaErrors(cudaFuncSetAttribute(
      foo, cudaFuncAttributeMaxDynamicSharedMemorySize, value));
}

void foo_print_occupancy() {
  size_t dynamicSmemSize = 0;

  // Do the following twice: once assuming no dynamic shared memory usage, and
  // once assuming maximum dynamic shared memory usage.
  for (size_t i = 0; i < 2; ++i) {
    int minGridSize, blockSize, numBlocks;

    // Get the suggested grid / block size pair that achieves the best potential
    // occupancy (i.e. the maximum number of active warps with the smallest
    // number of blocks).
    checkCudaErrors(cudaOccupancyMaxPotentialBlockSize(&minGridSize, &blockSize,
                                                       foo, dynamicSmemSize));
    // Get the the maximum number of active blocks per streaming multiprocessor
    // for the device function.
    checkCudaErrors(cudaOccupancyMaxActiveBlocksPerMultiprocessor(
        &numBlocks, foo, blockSize, dynamicSmemSize));
    // Get the maximum size of dynamic shared memory to allow `numBlocks` blocks
    // per SM
    checkCudaErrors(cudaOccupancyAvailableDynamicSMemPerBlock(
        &dynamicSmemSize, foo, numBlocks, blockSize));

    fprintf(stderr,
            "minGridSize = %d\nblockSize = %d\nnumBlocks = %d\ndynamicSmemSize "
            "= %zu\n",
            minGridSize, blockSize, numBlocks, dynamicSmemSize);

    // Try to launch a cooperative kernel
    cudaLaunchCooperativeKernel((const void *)foo, minGridSize, blockSize,
                                nullptr, dynamicSmemSize);
    printLastCudaError("cooperative kernel launch");
  }
}

int main() {
  foo_set_attributes();
  foo_print_occupancy();
}

On my machine (RTX 3090 rig running Windows 11; tested with CUDA Toolkit 12.0.1 AND 12.8.1), this program prints the following:

maxSharedMemoryPerBlockOptin = 101376
minGridSize = 164
blockSize = 768
numBlocks = 2
dynamicSmemSize = 51200
./main.cu(44) : getLastCudaError() CUDA error : cooperative kernel launch : (720) too many blocks in cooperative launch.
minGridSize = 82
blockSize = 1024
numBlocks = 1
dynamicSmemSize = 101376

This is
 strange, to say the least. cudaDeviceGetAttribute correctly reports that a maximum of 101,376 bytes (99 KB) of shared memory may be used per block; this is the expected value for devices of Compute Capability 8.6. However, cudaOccupancyAvailableDynamicSMemPerBlock seems to be returning a slightly larger value than it should be. This causes the maximum number of active thread blocks per SM to decrease from 2 to 1 and cuts occupancy by approximately half.

Curiously, the returned value (51,200 bytes, or 50 KB) is exactly half the amount of shared memory per SM (and remember, there are supposed to be 2 active thread blocks per SM). As of CC 8.0, “1 KB of shared memory [per SM]
 is reserved for system use”, and it seems that cudaOccupancyAvailableDynamicSMemPerBlock does not take this into account. Note that I tried compiling for my GPU’s native arch ("--generate-code=arch=compute_86,code=[sm_86]"), so it’s not an issue of CC compilation differences.

I’m posting this here in case I missed something, but otherwise this should definitely be fixed. I’ve already filed a bug report.

One can argue about this being an unexpected value/sharp edge instead of an incorrect one/bug given that the user can query and subtract the per-block reserved memory, but at the very least the documentation should be clear about what this function returns.

First, I think it’s important to note that there isn’t always* such thing as “per-block reserved memory”, only per-SM reserved memory. (Update: I was incorrect about there being “reserved shared memory per SM”; there is indeed only “reserved shared memory per block”.) Second, maybe I missed them, but I was unable to find an API for querying either per-block or per-SM reserved memory. Could you point me in that direction?

*Testing on my RTX 3090 indicated that there is no per-block reserved memory with CC 8.6, but further testing on an H100 appears to indicate that there is with CC 9.0. I’ll write a follow-up on this soon.

cudaGetDeviceProperties() gives you access to reservedSharedMemPerBlock. It being zero for certain architectures/GPUs would not be a problem.

I just saw that this entry in the device properties is missing from the struct declaration in the documentation and only listed below that, i.e. another documentation “bug”.

1 Like

Upon further consideration, this is a documentation bug at the very least. The documentation for cudaOccupancyAvailableDynamicSMemPerBlockstates that it “returns in *dynamicSmemSize the maximum size of dynamic shared memory to allow numBlocks blocks per SM”, yet actually using dynamicSmemSize bytes of dynamic shared memory per thread block does not allow numBlocks blocks per SM.

Furthermore, I still contend that this is a proper bug, as cudaDevAttrMaxSharedMemoryPerBlockOptin correctly accounts for reserved shared memory, and cudaDevAttrReservedSharedMemoryPerBlock alone is not sufficient for computing the actual maximum size of dynamic shared memory per block. It’s also worth noting that cudaDevAttrMaxSharedMemoryPerMultiprocessor correctly does not account for reserved shared memory, thereby reinforcing the idea that values relating to “maximum amount of shared memory per block” should account for reserved shared memory.

I’d be curious to see if cudaFuncSetCacheConfig/cudaFuncAttributePreferredSharedMemoryCarveout changes the return value of cudaOccupancyAvailableDynamicSMemPerBlock on Kepler GPUs, or what the return value is if it doesn’t. Does it return the correct value according to the preferred carveout, or does it always return the value of cudaDevAttrMaxSharedMemoryPerMultiprocessor? Unfortunately, I don’t have a Kepler GPU to experiment with.

Volta and newer GPUs also have unified shared memory and L1 (+ Texture Cache), so you don’t need ancient Fermi/Kepler to check that. See e.g. 1. NVIDIA Ampere GPU Architecture Tuning Guide — NVIDIA Ampere Tuning Guide 12.8 documentation for your 3090.

1 Like

It’s also worth noting that it isn’t as simple as “manually subtracting reservedSharedMemPerBlock from the output of cudaOccupancyAvailableDynamicSMemPerBlock()”; you also have to account for the possibility that the returned value is already less than or equal to the “true” maximum. This is the simplest rendition I can think of:

int device;
cudaGetDevice(&device);

int sharedMemPerMultiprocessor, reservedSharedMemPerBlock;
cudaDeviceGetAttribute(&sharedMemPerMultiprocessor, cudaDevAttrMaxSharedMemoryPerMultiprocessor, device);
cudaDeviceGetAttribute(&reservedSharedMemPerBlock, cudaDevAttrReservedSharedMemoryPerBlock, device);

size_t dynamicSmemSize;
cudaOccupancyAvailableDynamicSMemPerBlock(&dynamicSmemSize, func, numBlocks, blockSize);

dynamicSmemSize = min(dynamicSmemSize, sharedMemPerMultiprocessor / numBlocks - reservedSharedMemPerBlock);

And at this point, I’m not sure why you would use cudaOccupancyAvailableDynamicSMemPerBlock, as this is exactly equivalent:

int device;
cudaGetDevice(&device);

int sharedMemPerMultiprocessor, reservedSharedMemPerBlock;
cudaDeviceGetAttribute(&sharedMemPerMultiprocessor, cudaDevAttrMaxSharedMemoryPerMultiprocessor, device);
cudaDeviceGetAttribute(&reservedSharedMemPerBlock, cudaDevAttrReservedSharedMemoryPerBlock, device);

size_t dynamicSmemSize;
{
  cudaFuncAttributes attr;
  cudaFuncGetAttributes(&attr, func);
  dynamicSmemSize = attr.maxDynamicSharedSizeBytes;
}
dynamicSmemSize = min(dynamicSmemSize, sharedMemPerMultiprocessor / numBlocks - reservedSharedMemPerBlock);

It just feels like the user is doing most of the heavy lifting here, and it doesn’t feel like intended behavior.

If func’s cudaFuncAttributeMaxDynamicSharedMemorySize is not set close to the value of cudaDevAttrMaxSharedMemoryPerBlockOptin (e.g. the default is 48KB) and/or numBlocks == 1, you’d be subtracting 1KB of shared memory for no reason.

1 Like

Honestly, after playing around with the code and reading the docs again, I’m not quite sure what I got hung up on earlier. This is clearly a functional bug. The docs and even the API name contain “dynamic shared memory” and there is no reason to believe reserved shared memory is part of that. Have you filed it as a bug already?

1 Like

Another interesting property of the bug is that once you use any static shared memory in the kernel (in a way that con’t be optimized away by the compiler), the problem seems to vanish. I.e. cudaOccupancyAvailableDynamicSMemPerBlock() does respect reserved shared memory, but only if there is also static shared memory to be considered.

1 Like

Hi , this responds to NVBUG ID 5209234 that we are in tracking with the CUDA engineering team .

You are correct that what happens here is “Note that the maximum amount of shared memory per thread block is smaller than the maximum shared memory partition available per SM. The 1 KB of shared memory not made available to a thread block is reserved for system use.” See 1. Introduction — CUDA C++ Programming Guide.

And looks like ‘cudaDevAttrMaxSharedMemoryPerBlockOptin’ reports reserved shared memory deducted . But ‘cudaOccupancyAvailableDynamicSMemPerBlock’ still reports the original .

We will follow up with engineering team and bring back conclusion here .

2 Likes

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.