Why is shared memory configuration size is limiting the occupancy

I have a kernel that was compiled for sm_90 that runs on H100 Gpu. Following is its launch statistics. It consumes 18.82 KB of shared memory. Since H100 has 228KB of shared memory per SM, technically considering the shared memory alone it should be able to reside 12 blocks

But as per the profiler, It says only 5 blocks can reside in a SM. This could be due to the Shared Memory Configuration Size which is set as 102.4KB. How is the 102.4KB is set? If I could increase that limit I should be able to run with a bigger grid size

Have you consulted section “16.8.3. Shared Memory” of the CUDA C++ Programming Guide on how to configure shared memory on Hopper architecture parts?

Hi @njuffa ,

I have set the preferred carveout value too 100 as mentioned in the reference you have provided, but still it doesn’t utilize the max available shared memory. Did you mean to say some other configuration?

I have the following small example which limits the occupancy to 4 in h100 while it could go for 32

#include <iostream>
#include <vector>
#include <cooperative_groups.h>
#include <cuda/barrier>

#define CHECK_CUDA(call)                                                  \
    {                                                                     \
        cudaError_t err = call;                                           \
        if (cudaSuccess != err) {                                         \
            fprintf(stderr, "Cuda error in file '%s' in line %i : %s.\n", \
                    __FILE__, __LINE__, cudaGetErrorString(err));         \
            exit(EXIT_FAILURE);                                           \
        }                                                                 \
    }


using namespace cooperative_groups;
__global__ void test() {
    auto grid = this_grid();
    grid.sync();
    grid.sync(); 
}

int main(int v, char** val) {
    int numThreads = 64;
    int occupancy = 0;
    CHECK_CUDA(cudaFuncSetAttribute(test, cudaFuncAttributePreferredSharedMemoryCarveout, 100));
    int maxbytes = 98304; // 96 KB
    CHECK_CUDA(cudaFuncSetAttribute(test, cudaFuncAttributeMaxDynamicSharedMemorySize, maxbytes));
    CHECK_CUDA(cudaOccupancyMaxActiveBlocksPerMultiprocessor(
        &occupancy, (void *)test, numThreads, 0));
    std::cout << " occupancy : " << occupancy << std::endl;
}

For H100 the output is 4 while for A100 it is 32