Default value of max dynamic shared memory

I’m working on CUDA 11.2 and Compute Capability 8.6. I have a kernel which might require more than 48KB of dynamic shared memory, so I wrote

int dynamic_smem_size = // ... some calculation logic
if (dynamic_smem_size >= (48 << 10)) {
    cudaError_t result = cudaFuncSetAttribute(
        MyKernel, cudaFuncAttributeMaxDynamicSharedMemorySize,
        dynamic_smem_size);
}
// launch the kernel..

but launching the kernel gives me cudaErrorInvalidValue when the dynamic_smem_size is 48964. Then I tried

  cudaFuncAttributes attrs;
  cudaFuncGetAttributes(&attrs, MyKernel);
  LOG(INFO) << attrs.maxDynamicSharedSizeBytes;

the result is 48724. It looks like the the default value for cudaFuncAttributeMaxDynamicSharedMemorySize for my kernel is not exactly 48 << 10 bytes. Why is that?

I suggest providing a short, complete, test case. If you are also using statically allocated shared memory, that will probably be at least part of the explanation.

FWIW 48964 is not larger than or equal to 48 << 10, so the first code snippet you have shown would be irrelevant for that case.

and with respect to the 2nd code snippet, if you don’t request anything higher (you haven’t, it seems to me) the value you are going to get is 48K (i.e. 49152) reduced by some reservation made by the CUDA runtime for its own use.

Yes, I also used 392 bytes of static shared memory. But 49152 - 392 = 48760 is still 36 bytes larger than 48724.

what is this reservation size?

For 8.X, it’s.:

“Note that the maximum amount of shared memory per thread block is smaller than the maximum shared memory partition available per SM. The 1 KB of shared memory not made available to a thread block is reserved for system use.”

@rs277 I think this is irrelevant to my problem. It just explains why the maximum amount of shared memory per thread block (99KB on sm86) is 1KB smaller than the maximum shared memory partition available per SM (100KB on SM86) .

To avoid further speculation as to what is relevant to your problem, consider posting a minimal self-contained reproducer code that others can build and run.

This is a simple test:

#include <stdio.h>
#include <cuda.h>

template<typename T, int static_smem_size>
__global__ void MyKernel(T* dst, const T* src, int N){
  __shared__ T s1[static_smem_size];
  extern __shared__ T input[];
  for (int i = threadIdx.x; i < static_smem_size; i += blockDim.x) {
    s1[i] = src[i % N];
  }
  __syncthreads();
  for (int i = threadIdx.x + blockIdx.x * blockDim.x; i < N; i += blockDim.x * gridDim.x) {
    dst[i] = 2.0f * s1[i % static_smem_size];
  }
}

int main() {
  cudaFuncAttributes attrs;
  cudaError_t success = cudaFuncGetAttributes(&attrs, MyKernel<float, 98>);
  if (success != cudaSuccess) {
    printf("cudaFuncGetAttributes error: %s\n", cudaGetErrorName(success));
    return 1;
  }
  printf("attrs.maxDynamicSharedSizeBytes=%d\n", attrs.maxDynamicSharedSizeBytes);
  return 0;
}

The output is

attrs.maxDynamicSharedSizeBytes=48760

It makes sense: 48760 + sizeof(float) * 98 = 49152. But for my real kernel (sorry I can’t post the originial code here), which also uses 98 floats in static shared memory, the default value is 48724. What else can be responsible for the extra 36 bytes? Maybe __shfl_sync?

Without showing the code, it is only speculation. I would guess that somewhere in the kernel, maybe in a library, there is another static shared memory declaration.

If you have a posted kernel and a real kernel with different behaviour, you can do a binary search over the differences to find the reason.