Static allocation successfully for more than 48KB shared memory?

As Programming Manual says “above 48 KB requires dynamic shared memory”, I test on RTX 4090D with static allocated 64KB shared memory. However, the following kernel run without any failure or warning and print expected result. Do I miss something? It seems static method supports to allocate more than 48KB ?

#include <cuda_runtime.h>
#include <stdio.h>
#include <iostream>

__global__ void func(float* output0, float* input0) {

 constexpr int SIZE = 25344;   // 4B * 25344 = 101376B = 99KB
  
 __shared__ float data[SIZE];  // static allocation for 99KB
  data[0] = 0.0f;
  if (threadIdx.x == 0 ) {
    for(int i = 1 ; i < SIZE; i++) {
      data[i] = data[i-1] + 1.0f;
    }
  }
  __syncthreads();

  output0[threadIdx.x] = input0[threadIdx.x] + data[SIZE - 1 - threadIdx.x];
}




int main() {
  const int N = 512;
  const int block_size = 512;
  const int grid_size = 10;


  float *h_input = new float[N];
  float *h_output = new float[N];

  for (int i = 0; i < N; i++) {
    h_input[i] = 1;
    h_output[i] = 0;
  }

  float *d_input, *d_output;
  cudaMalloc(&d_input, N * sizeof(float));
  cudaMalloc(&d_output, N * sizeof(float));

  cudaMemcpy(d_input, h_input, N * sizeof(float), cudaMemcpyHostToDevice);

  // no any opt-in method
  func<<<dim3(grid_size,1,1), dim3(block_size,1,1)>>>(d_output, d_input);
  cudaError_t err = cudaGetLastError();
  if (err != cudaSuccess) {
    printf("Failed to launch kernel: %s\n", cudaGetErrorString(err));
  }

  cudaMemcpy(h_output, d_output, N * sizeof(float), cudaMemcpyDeviceToHost);
  
  cudaFree(d_input);
  cudaFree(d_output);

  for (int i = 0; i < 10; i++) {
    std::cout << i << ": " << h_output[i] << std::endl;
  }

  delete[] h_input;
  delete[] h_output;
  return 0;
}

Cannot be compiled on godbolt:

Compiler Explorer

ptxas error   : Entry function '_Z4funcPfS_' uses too much shared data (0x18c00 bytes, 0xc000 max)

Compiler returned: 255

Interesting! I try on godbat and got the same error as you did.

I also tried with local nvcc compiler and got the ptx code successfully

//
// Generated by NVIDIA NVVM Compiler
//
// Compiler Build ID: CL-35059454
// Cuda compilation tools, release 12.6, V12.6.85
// Based on NVVM 7.0.1
//

.version 8.5
.target sm_89
.address_size 64

	// .globl	_Z4funcPfS_
// _ZZ4funcPfS_E4data has been demoted

.visible .entry _Z4funcPfS_(
	.param .u64 _Z4funcPfS__param_0,
	.param .u64 _Z4funcPfS__param_1
)
{
	.reg .pred 	%p<3>;
	.reg .f32 	%f<11>;
	.reg .b32 	%r<17>;
	.reg .b64 	%rd<8>;
	// demoted variable
	.shared .align 4 .b8 _ZZ4funcPfS_E4data[101376];

	ld.param.u64 	%rd1, [_Z4funcPfS__param_0];
	ld.param.u64 	%rd2, [_Z4funcPfS__param_1];
	mov.u32 	%r7, 0;
	st.shared.u32 	[_ZZ4funcPfS_E4data], %r7;
	mov.u32 	%r1, %tid.x;
	setp.ne.s32 	%p1, %r1, 0;
	@%p1 bra 	$L__BB0_4;

	mov.f32 	%f10, 0f00000000;
	mov.u32 	%r16, 8;
	mov.u32 	%r15, _ZZ4funcPfS_E4data;
	bra.uni 	$L__BB0_2;

$L__BB0_3:
	add.f32 	%f10, %f2, 0f3F800000;
	st.shared.f32 	[%r15+16], %f10;
	add.s32 	%r16, %r16, 16;
	add.s32 	%r15, %r15, 16;

$L__BB0_2:
	add.f32 	%f5, %f10, 0f3F800000;
	st.shared.f32 	[%r15+4], %f5;
	add.f32 	%f6, %f5, 0f3F800000;
	st.shared.f32 	[%r15+8], %f6;
	add.f32 	%f2, %f6, 0f3F800000;
	st.shared.f32 	[%r15+12], %f2;
	setp.eq.s32 	%p2, %r16, 101368;
	@%p2 bra 	$L__BB0_4;
	bra.uni 	$L__BB0_3;

$L__BB0_4:
	bar.sync 	0;
	cvta.to.global.u64 	%rd3, %rd2;
	mul.wide.u32 	%rd4, %r1, 4;
	add.s64 	%rd5, %rd3, %rd4;
	mov.u32 	%r10, 25343;
	sub.s32 	%r11, %r10, %r1;
	shl.b32 	%r12, %r11, 2;
	mov.u32 	%r13, _ZZ4funcPfS_E4data;
	add.s32 	%r14, %r13, %r12;
	ld.shared.f32 	%f7, [%r14];
	ld.global.f32 	%f8, [%rd5];
	add.f32 	%f9, %f8, %f7;
	cvta.to.global.u64 	%rd6, %rd1;
	add.s64 	%rd7, %rd6, %rd4;
	st.global.f32 	[%rd7], %f9;
	ret;

}

The error is reported by ptxas when it compiles the PTX code to SASS (machine code). ptxas is the compiler component that performs all the architecture-specific work during CUDA compilation.

That’s true. I can compiled to sass code locallly. It’s quite long and due to my company’s cybersecurity policy, I cannot upload it as a file.

Can you name the version of your used ptxas and the command line with parameters/options?

That’s OK. Your task is to present code and a compiler invocation that reproduces the claimed behavior. It does not have to be related to your production code.

With the code posted at the top of this thread I cannot reproduce the claimed behavior when I build locally or at Godbolt. Have you double-checked your build process to make sure it is not silently ignoring errors?

Trying different compilers on godbolt, the code compiles with NVCC 12.4.1 - 12.6.2 .

I would not rely on this behavior (compiler defect?) and always use external shared memory for smem sizes > 48kb.

2 Likes

I compiled with the following script

NVCC Version:

  • Cuda compilation tools, release 12.6, V12.6.85
  • Build cuda_12.6.r12.6/compiler.35059454_0
#!/bin/bash
set -e

# --------------------------
# init args & baisc check
# --------------------------
if [ $# -ne 1 ]; then
    echo "Usage: $0 file.cu <target arch(e.g. sm_89)>"
    echo "Example: $0 kernel.cu"
    exit 1
fi

INPUT_CUDA="$1"       # input cu file
TARGET_ARCH="sm_89"   # gpu arch for 4090D
OUTPUT_PTX="output.ptx"   
OUTPUT_CUBIN="output.cubin" 
OUTPUT_SASS="output.sass"   

# find input file
if [ ! -f "$INPUT_CUDA" ]; then
    echo "FATAL: cu file: $INPUT_CUDA not exists"
    exit 1
fi

# check nvcc
if ! command -v nvcc &> /dev/null; then
    echo "FATAL: not found nvcc"
    exit 1
fi

# check cuobjdump
if ! command -v cuobjdump &> /dev/null; then
    echo "FATAL: not found cuobjdump"
    exit 1
fi

# --------------------------
# Step1: Cu file -> cubin
# --------------------------
echo "Generating cubin(GPU Arch: $TARGET_ARCH)..."
nvcc -arch="$TARGET_ARCH" -cubin "$INPUT_CUDA" -O2 -o "$OUTPUT_CUBIN"

if [ $? -ne 0 ]; then
    echo "FATAL: CUBIN Compile Failed!"
    exit 1
fi

# --------------------------
# Step2:cubin -> ptx
# --------------------------
echo "Fetching ptx from cubin ..."
nvcc --ptx -arch="$TARGET_ARCH" -cubin "$INPUT_CUDA" -O2 -o "$OUTPUT_PTX"

if [ $? -ne 0 ]; then
    echo "FATAL: Fetch PTX Failed!"
    exit 1
fi

# --------------------------
# Step3:disasm CUBIN -> sass
# --------------------------
echo "Disassembling sass"
cuobjdump -sass "$OUTPUT_CUBIN" > "$OUTPUT_SASS"

if [ $? -ne 0 ]; then
    echo "FATAL: SASS Failed!"
    exit 1
fi

# --------------------------
# Job Done!
# --------------------------
echo "All Finished!"
echo "Summary:"
echo "  PTX: $OUTPUT_PTX"
echo "  SASS: $OUTPUT_SASS"

The compiler version you are using falls squarely into the range of compiler versions established by @striker159 above as not flagging a static shared memory size > 48kB as an error. This appears to be a bug in these compiler versions.

Consider upgrading to a new compiler version.