the result is 48724. It looks like the the default value for cudaFuncAttributeMaxDynamicSharedMemorySize for my kernel is not exactly 48 << 10 bytes. Why is that?
I suggest providing a short, complete, test case. If you are also using statically allocated shared memory, that will probably be at least part of the explanation.
FWIW 48964 is not larger than or equal to 48 << 10, so the first code snippet you have shown would be irrelevant for that case.
and with respect to the 2nd code snippet, if you don’t request anything higher (you haven’t, it seems to me) the value you are going to get is 48K (i.e. 49152) reduced by some reservation made by the CUDA runtime for its own use.
“Note that the maximum amount of shared memory per thread block is smaller than the maximum shared memory partition available per SM. The 1 KB of shared memory not made available to a thread block is reserved for system use.”
@rs277 I think this is irrelevant to my problem. It just explains why the maximum amount of shared memory per thread block (99KB on sm86) is 1KB smaller than the maximum shared memory partition available per SM (100KB on SM86) .
To avoid further speculation as to what is relevant to your problem, consider posting a minimal self-contained reproducer code that others can build and run.
#include <stdio.h>
#include <cuda.h>
template<typename T, int static_smem_size>
__global__ void MyKernel(T* dst, const T* src, int N){
__shared__ T s1[static_smem_size];
extern __shared__ T input[];
for (int i = threadIdx.x; i < static_smem_size; i += blockDim.x) {
s1[i] = src[i % N];
}
__syncthreads();
for (int i = threadIdx.x + blockIdx.x * blockDim.x; i < N; i += blockDim.x * gridDim.x) {
dst[i] = 2.0f * s1[i % static_smem_size];
}
}
int main() {
cudaFuncAttributes attrs;
cudaError_t success = cudaFuncGetAttributes(&attrs, MyKernel<float, 98>);
if (success != cudaSuccess) {
printf("cudaFuncGetAttributes error: %s\n", cudaGetErrorName(success));
return 1;
}
printf("attrs.maxDynamicSharedSizeBytes=%d\n", attrs.maxDynamicSharedSizeBytes);
return 0;
}
The output is
attrs.maxDynamicSharedSizeBytes=48760
It makes sense: 48760 + sizeof(float) * 98 = 49152. But for my real kernel (sorry I can’t post the originial code here), which also uses 98 floats in static shared memory, the default value is 48724. What else can be responsible for the extra 36 bytes? Maybe __shfl_sync?
Without showing the code, it is only speculation. I would guess that somewhere in the kernel, maybe in a library, there is another static shared memory declaration.