call stack/ stack frame

hello,

i am seating far less thread blocks concurrently, than expected; my kernel’s ptxas information reference the stack frame frequently

first of all, where does the stack frame/ call stack reside? the programming guide only mentions that its size can be queried/ set with cudaDeviceGetLimit() and cudaDeviceSetLimit(), not what its default size is, nor where it resides

It resides in local memory space per thread.

i suppose its default size is then the configured L1 cache size…?

“cudaLimitStackSize controls the stack size in bytes of each GPU thread.”

should i interpret the number of bytes stack frame reported by ptxas info per thread?

e.g. one of the device functions called by my kernel:

ptxas info : Function properties for Z21fwd_nf_upd_static_fwdjjdPdS
8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

Yes, the reporting is per thread. I don’t know if the default size is documented. I suggest just querying it if you need it. I’m not sure why it would be connected to L1 cache size.

“I suggest just querying it if you need it”

I shall do that, thanks

" I’m not sure why it would be connected to L1 cache size."

you noted that it resides in local memory, and to my understanding that then implicates local registers, the l1 cache and the l1 cache size as buffer before global memory, and then finally global memory; which really just leaves L1 cache and global memory, as i doubt whether registers are used to store the stack…

(in my case) the default value is 1024

i notice from my ptxas information that device functions that require/ have a reported stack frame value, are also likely to have reported spill stores and spill loads…

for example:

ptxas info : Function properties for _Z17fwd_nf_set_rangesjjPb
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

ptxas info : Function properties for Z29fwd_nf_resolve_failed_retractRbS_S_RjS0_S0_jjjdRdPcPjS3_PdS4_S4
120 bytes stack frame, 116 bytes spill stores, 116 bytes spill loads

is it safe to assume that the spills are related to the stack frame?
yet, in the case of the latter function - the function self is called by a single thread only, to set the kernel execution course; hence, i really doubt whether the stack itself spills, and the required stack frame bytes are lower than the default value

Are you sure the reason for your lower occupancy is the size of each thread’s stack? If anything, it might be because of the kernel design itself.

Just as with other ABIs, CUDA’s stack frame serves multiple purposes. One of is uses is to provide temporary storage for spilled registers. As the amount of spill data in fwd_nf_resolve_failed_retract() is almost exactly equal to the size of the stack frame, it is reasonable to assume that the only use of the stack frame in this case is to provide storage for spilled registers. To be absolutely sure, you would have to look at the disassembled machine code (SASS).

njuffa - to my rescue, as always

i do not have the code open in front of me, and plan to further look at it later on, but…

the noted function is called by a single thread to set direction as mentioned

this makes it very difficult for me to subscribe to any theory or spilling in the pure sense - a single thread in use, and the function itself hardly contains that many operands to even remotely suggest register spilling; otherwise i am really missing something here

on the other hand - because it is a control function, it sets shared memory variables; hence much on the stack are actually references, come to think of it…
that might perhaps explain the correlation between stack frame size, spill stores and spill loads
I do not know how pxtas would reference and report on references…

The spill statistics produced by PTXAS should be completely accurate. After all the compiler knows exactly when it spills and reloads registers. As to why there are spills, it is impossible to locate the source of the register pressure without seeing the code and the compiler switches with which the code was compiled. The code may be compiled with a very low register use target. It may call mathematical functions that require a lot of temporary registers internally. Those are just two of the possibilities.

If you want to analyze this in detail, you need to look at SASS, not just PTX. PTX is merely an intermediate format that uses virtual registers. The PTX code is compiled into SASS by PTXAS, which performs many code transformations and is also responsible for allocating physical registers. It is at this stage that register spilling takes place.

Thread configuration is a runtime issue unknown to the compiler. The compile-time view of the code by the compiler is (with a few exceptions) that of a single-thread program.

i understood most of what you said, except for this one line:

“The code may be compiled with a very low register use target.”

could you elaborate, please

I think he meant the use of the

--maxregcount <i>amount</i>

parameter you can pass to nvcc.

You have to scroll down a little bit to find it.

When I mentioned a “low register use target” I was referring to either the use of -maxrregcount compiler flag or the launch_bounds() function attribute. My main point is that without seeing the entire code and the compilation flags, trying to pin-point a reason for the register spills is just going to result in a lot of speculation.

“My main point is that without seeing the entire code and the compilation flags, trying to pin-point a reason for the register spills is just going to result in a lot of speculation.”

noted and understood

however, i doubt whether the context of a single spilling function would be sufficient to understand the cause of spilling; some additional information:

something i have not mentioned is that i indeed manage to seat 8 blocks per SM concurrently, at present; 32 threads per block
i thought that i would be able to seat more blocks concurrently; i now suppose the present occurring spilling explains why the compiler feels it is sub-optimal to allow more blocks to run concurrently

i have little reason to synchronize across blocks (as opposed to synchronizing within blocks), so i don’t; i guess that might equally complicate and impact register allocation, spilling, and tracking spilling

the ptxas information for the kernel itself (as opposed to ptxas information for device functions called by the kernel):

ptxas info : Function properties for Z8fwd_krnlbbjjjjjjjjjjddddPbS_S_S_S_S_PcS0_PjS1_S1_S1_S1_S1_S1_S1_S1_PdS2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2
208 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 127 registers, 197 bytes smem, 688 bytes cmem[0]

if you look at the stack frame of the kernel itself, and suppose that that is stored/ pushed in registers, then seemingly the kernel itself is using up (close to) all of the SM’s registers merely by calling device functions, when considering an occupancy of 8 blocks per SM
i think i am passing far too many function parameters via the stack - if i reduce this, spilling should reduce or dissapear; and this is actually something i can test
i shall look into rather passing function parameters via the constant space, or as pointers to pointers via shared memory, by amalgamating parameters into arrays

The statistics for fwd_krnl() indicate that stack frame use is entirely for other purposes than providing storage for register spills. This is different from fwd_nf_resolve_failed_retract(), where the statistics indicate that most, or all, of the stack frame usage is due to spilling. So these are very different cases.

One would not readily expect negative performance impact from the use of 208 bytes of stack frame in fwd_krnl(). Even the moderate amount of spilling observed in fwd_nf_resolve_failed_retract() is not necessarily indicative of a performance problem, as there is a balance between occupancy and register pressure (and thus possibly spilling) for any given kernel, and it may well be optimal for that kernel at that level of spilling. Only more detailed analysis can show one way or the other.

Registers are being spilled to (thread-)local memory when data objects allocated to registers exceeds the number of available physical registers. In other words, “register pressure” is high.

The number of available physical registers is a function of (a) the GPU architecture, which imposes a strict upper limit on the number of registers available per thread (b) programmer control via the -maxrregister compiler flag and launch_bounds() function attribute.

The number of data objects allocated to registers is primarily a function of scalar data variables occurring in the code, although in some instances small arrays may be allocated in registers. This includes variables in the source code, variables used by device functions incorporated into the code such as inlined device functions provided by the programmer, or standard C math functions and operations. In addition, it includes temporary variables created by the compiler during code optimizations, for example common sub-expressions extracted during CSE, or induction variables created by strength reduction in loops.

Since various compiler optimizations take place before there is any good estimate of register usage available (since register allocation doesn’t occur until the latter stages of PTXAS compilation) it is possible that some compiler code transformations lead to undesirable increases in register pressure. Compiler tuning over the past three to four years with the goal of tweaking relevant heuristics has significantly reduced such instances.

well, quite a mouth-full, but noted nonetheless

i initially went after the stack frame as it was the first loose link for me, in explaining why the compiler refuses to further increase occupancy - seat more blocks concurrently

and i do think that the stack frame explains this - the relative size of the stack frame suggests relatively high register usage, occasionally leading to spilling; because of spilling, the compiler perceives additional blocks as sub-optimal, as it might start to imply spilling to global memory, instead of just L1 cache

so, if i work at reducing the stack frame, by reducing function parameters passed via the stack, i should be able to reduce register pressure, and be able to seat more blocks concurrently
seating more blocks in turn is important to me, as most if not all of the kernel’s functions contain global memory reads, which should generally slow down execution; i am not convinced that 8 blocks are enough to negate this

from a pure efficiency perspective, is it indeed efficient to actually pass that many parameters via the stack, particularly in the case of parallel platforms/ implementations…?

if a (device) function requires for argument sake 10 parameters to conclude, and can not be decreased in size any further, would it be more efficient to pass the parameters via the stack, or to seek other means, like, rolling the parameters up and passing them as a pointer to the parameters in shared memory or constant memory

this is really the emerging issue when you increase the device’s autonomy - you start using what would effectively have been kernels, as device functions, with the implied effect on function parameters

Clearly the answer is no. This has always been an aspect of parallel programming. Overheads get replicated for every thread. If you call a (non-inlined) function with two float parameters in a single threaded program you push 8 bytes through the memory hierarchy. If you call the same function simultaneously from 1024 threads then you push 8KB through the memory hierarchy.

Factoring out overheads such that you pay them once rather than once-per-thread is often a good idea.

noted; thanks

are there any established methods of “factoring out overheads” that i should essentially be aware of…?

right now, the only way i can think of to essentially reduce the function parameter count - hypothetically at 10 - to 1, is to roll the 10 function parameters up into an array, and to pass a pointer to the array…be it via shared or constant memory