__forceinline__, __noinline__ and "Too many resources"

RTX 3080 12GB, CUDA 12.1, sm 8.6, driver 531.79, Visual Studio 2022 17.5.5

I have two versions of the same function, with inlining and without:

functions.cuh:

__device__ __forceinline__ void _foo(...)
	{
	// full body, lots of stuff
	...
	}
	
__device__ __noinline__ void _foo_noinline(...);

functions.cu:

__device__ __noinline__ void _foo_noinline(...)
	{
	_foo(...);
	}

PTX says:

1>ptxas info    : Function properties for _Z17_foo_...
1>    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

I’m using it in function ‘RenderGeometryVolume’ (in other unit).
RenderGeometryVolume has maxDynamicSharedSizeBytes set to 65536 bytes (with maximum shared memory per block, opt-in, 101376).
Block is 1024 threads big.
When I use inline version of the foo in RenderGeometryVolume, PTX info looks like this (release):

1>ptxas info    : Compiling entry function '_Z27Kernel_RenderGeometryVolume35SKernel_RenderGeometryVolume_Params' for 'sm_86'
1>ptxas info    : Function properties for _Z27Kernel_RenderGeometryVolume35SKernel_RenderGeometryVolume_Params
1>    496 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1>ptxas info    : Used 97 registers, 364 bytes cmem[0]

As my block is 1024 threads big, 97 > 64, obviously I get “Too many resources requested for launch”.

When I use “noinline” version, PTX info looks like this (release):

1>ptxas info    : Compiling entry function '_Z27Kernel_RenderGeometryVolume35SKernel_RenderGeometryVolume_Params' for 'sm_86'
1>ptxas info    : Function properties for _Z27Kernel_RenderGeometryVolume35SKernel_RenderGeometryVolume_Params
1>    368 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1>ptxas info    : Used 47 registers, 364 bytes cmem[0]

47 < 64, nice. But I’m still getting “Too many resources requested for launch”!
Why?


So, as an experiment, I add to the ‘render.cu’ unit a function:

__device__ __noinline__ void _foo_forvolume(...)
	{
	_foo_noinline(...);
	}

PTX (release):

1>ptxas info    : Function properties for _Z18_foo...
1>    16 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads
1>ptxas info    : Compiling entry function '_Z27Kernel_RenderGeometryVolume35SKernel_RenderGeometryVolume_Params' for 'sm_86'
1>ptxas info    : Function properties for _Z27Kernel_RenderGeometryVolume35SKernel_RenderGeometryVolume_Params
1>    304 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1>ptxas info    : Used 36 registers, 364 bytes cmem[0]

Now it’s 36.
All I did is wrap ‘noinline’ version in ‘noinline’ function, but somehow number or registers used changed.
Still getting “Too many resources requested for launch”.


Just for the full picture, debug versions:

Now PTX shows noinline version:

1>ptxas info    : Function properties for _Z17_foo_noinline...
1>    3000 bytes stack frame, 176 bytes spill stores, 176 bytes spill loads

RenderGeometryVolume with inline version of foo:

1>ptxas info    : Compiling entry function '_Z27Kernel_RenderGeometryVolume35SKernel_RenderGeometryVolume_Params' for 'sm_86'
1>ptxas info    : Function properties for _Z27Kernel_RenderGeometryVolume35SKernel_RenderGeometryVolume_Params
1>    11312 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1>ptxas info    : Used 170 registers, 364 bytes cmem[0]

RenderGeometryVolume with noinline version (same info when using local wrapper):

1>ptxas info    : Compiling entry function '_Z27Kernel_RenderGeometryVolume35SKernel_RenderGeometryVolume_Params' for 'sm_86'
1>ptxas info    : Function properties for _Z27Kernel_RenderGeometryVolume35SKernel_RenderGeometryVolume_Params
1>    560 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1>ptxas info    : Used 42 registers, 364 bytes cmem[0]

To be clear: there is no problem using foo in relatively simple kernels (which uses small number of registers).
I suspect that in this case compiler ignores __noinline__ attribute after __inline__.
If I paste foo body into foo_noinline, then everything works as intended.
Another thing is, why the number of registers change when making a wrapper that shouldn’t change anything.