Why is shared memory configuration size is limiting the occupancy

BAdhi · June 3, 2023, 4:11am

I have a kernel that was compiled for sm_90 that runs on H100 Gpu. Following is its launch statistics. It consumes 18.82 KB of shared memory. Since H100 has 228KB of shared memory per SM, technically considering the shared memory alone it should be able to reside 12 blocks

But as per the profiler, It says only 5 blocks can reside in a SM. This could be due to the Shared Memory Configuration Size which is set as 102.4KB. How is the 102.4KB is set? If I could increase that limit I should be able to run with a bigger grid size

njuffa · June 3, 2023, 4:32am

Have you consulted section “16.8.3. Shared Memory” of the CUDA C++ Programming Guide on how to configure shared memory on Hopper architecture parts?

BAdhi · June 4, 2023, 9:57pm

Hi @njuffa ,

I have set the preferred carveout value too 100 as mentioned in the reference you have provided, but still it doesn’t utilize the max available shared memory. Did you mean to say some other configuration?

I have the following small example which limits the occupancy to 4 in h100 while it could go for 32

#include <iostream>
#include <vector>
#include <cooperative_groups.h>
#include <cuda/barrier>

#define CHECK_CUDA(call)                                                  \
    {                                                                     \
        cudaError_t err = call;                                           \
        if (cudaSuccess != err) {                                         \
            fprintf(stderr, "Cuda error in file '%s' in line %i : %s.\n", \
                    __FILE__, __LINE__, cudaGetErrorString(err));         \
            exit(EXIT_FAILURE);                                           \
        }                                                                 \
    }


using namespace cooperative_groups;
__global__ void test() {
    auto grid = this_grid();
    grid.sync();
    grid.sync(); 
}

int main(int v, char** val) {
    int numThreads = 64;
    int occupancy = 0;
    CHECK_CUDA(cudaFuncSetAttribute(test, cudaFuncAttributePreferredSharedMemoryCarveout, 100));
    int maxbytes = 98304; // 96 KB
    CHECK_CUDA(cudaFuncSetAttribute(test, cudaFuncAttributeMaxDynamicSharedMemorySize, maxbytes));
    CHECK_CUDA(cudaOccupancyMaxActiveBlocksPerMultiprocessor(
        &occupancy, (void *)test, numThreads, 0));
    std::cout << " occupancy : " << occupancy << std::endl;
}

For H100 the output is 4 while for A100 it is 32

Topic		Replies	Views
shared memory and CUDA calculator CUDA Programming and Performance	6	4041	October 26, 2008
Find the limit of shared memory that can be used per block CUDA Programming and Performance	2	13525	March 17, 2017
Why is the amount of thread blocks per cluster and the dynamic shared memory that I can allocate much lower than expected? CUDA Programming and Performance	8	135	December 15, 2024
The configurations of the shared and l1 caches ?! Nsight Compute	4	620	January 22, 2024
How the Shared Memory Configuration Size is calcuated？ CUDA Programming and Performance cuda , kernel	6	134	January 23, 2025
cudaOccupancyAvailableDynamicSMemPerBlock returning incorrect value CUDA Programming and Performance hw , cuda , kernel , nvbugs	12	75	April 9, 2025
H100 Shared Memory Limit Discrepancy CUDA NVCC Compiler	2	407	April 18, 2024
Max shared memory CUDA Programming and Performance	0	1264	July 28, 2020
How to set shared memory size to a specific value? CUDA Programming and Performance	3	375	July 19, 2024
Shared memory limits and cudaError_enum How to precisely determine how much of the shared memory is CUDA Programming and Performance	5	2811	April 29, 2009

Why is shared memory configuration size is limiting the occupancy

Related topics