RTX 3080 12GB, CUDA 12.1, sm 8.6, driver 531.79, Visual Studio 2022 17.5.5
I have two versions of the same function, with inlining and without:
functions.cuh:
__device__ __forceinline__ void _foo(...)
{
// full body, lots of stuff
...
}
__device__ __noinline__ void _foo_noinline(...);
functions.cu:
__device__ __noinline__ void _foo_noinline(...)
{
_foo(...);
}
PTX says:
1>ptxas info : Function properties for _Z17_foo_...
1> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
I’m using it in function ‘RenderGeometryVolume’ (in other unit).
RenderGeometryVolume has maxDynamicSharedSizeBytes set to 65536 bytes (with maximum shared memory per block, opt-in, 101376).
Block is 1024 threads big.
When I use inline version of the foo in RenderGeometryVolume, PTX info looks like this (release):
1>ptxas info : Compiling entry function '_Z27Kernel_RenderGeometryVolume35SKernel_RenderGeometryVolume_Params' for 'sm_86'
1>ptxas info : Function properties for _Z27Kernel_RenderGeometryVolume35SKernel_RenderGeometryVolume_Params
1> 496 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1>ptxas info : Used 97 registers, 364 bytes cmem[0]
As my block is 1024 threads big, 97 > 64, obviously I get “Too many resources requested for launch”.
When I use “noinline” version, PTX info looks like this (release):
1>ptxas info : Compiling entry function '_Z27Kernel_RenderGeometryVolume35SKernel_RenderGeometryVolume_Params' for 'sm_86'
1>ptxas info : Function properties for _Z27Kernel_RenderGeometryVolume35SKernel_RenderGeometryVolume_Params
1> 368 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1>ptxas info : Used 47 registers, 364 bytes cmem[0]
47 < 64, nice. But I’m still getting “Too many resources requested for launch”!
Why?
So, as an experiment, I add to the ‘render.cu’ unit a function:
__device__ __noinline__ void _foo_forvolume(...)
{
_foo_noinline(...);
}
PTX (release):
1>ptxas info : Function properties for _Z18_foo...
1> 16 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads
1>ptxas info : Compiling entry function '_Z27Kernel_RenderGeometryVolume35SKernel_RenderGeometryVolume_Params' for 'sm_86'
1>ptxas info : Function properties for _Z27Kernel_RenderGeometryVolume35SKernel_RenderGeometryVolume_Params
1> 304 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1>ptxas info : Used 36 registers, 364 bytes cmem[0]
Now it’s 36.
All I did is wrap ‘noinline’ version in ‘noinline’ function, but somehow number or registers used changed.
Still getting “Too many resources requested for launch”.
Just for the full picture, debug versions:
Now PTX shows noinline version:
1>ptxas info : Function properties for _Z17_foo_noinline...
1> 3000 bytes stack frame, 176 bytes spill stores, 176 bytes spill loads
RenderGeometryVolume with inline version of foo:
1>ptxas info : Compiling entry function '_Z27Kernel_RenderGeometryVolume35SKernel_RenderGeometryVolume_Params' for 'sm_86'
1>ptxas info : Function properties for _Z27Kernel_RenderGeometryVolume35SKernel_RenderGeometryVolume_Params
1> 11312 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1>ptxas info : Used 170 registers, 364 bytes cmem[0]
RenderGeometryVolume with noinline version (same info when using local wrapper):
1>ptxas info : Compiling entry function '_Z27Kernel_RenderGeometryVolume35SKernel_RenderGeometryVolume_Params' for 'sm_86'
1>ptxas info : Function properties for _Z27Kernel_RenderGeometryVolume35SKernel_RenderGeometryVolume_Params
1> 560 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1>ptxas info : Used 42 registers, 364 bytes cmem[0]