ptxas -v output : kernel calling device functions


I have a kernel called global funcMAIN and inside that I call a function device calcX.

My questions are :

  1. The funcMAIN does not show any spilling into local mem what I can see but calcX shows a spill store and spill load of 8 bytes. So if calcX shows a spill why does that not show up in the ptxas output for the funcMain. The function is set to be inline but does the 8 bytes come from that the function could no be inlined?

  2. The register count that shows for funcMain is that the total register count used for that kernel including registers used by functions that are called inside funcMain?

ptxas : info : Compiling entry function funcMAIN for 'sm_30'
ptxas : info : Function properties for funcMAIN 80 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Used 24 registers, 736 bytes cumulative stack size, 412 bytes cmem[0]

ptxas : info : Function properties for calcX 616 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads

It is hard to tell from this information alone, but to me this looks like calcX was not inlined. Yo can check by dumping the machine code with cuobjdump --dump-sass [executable-name]. You state that calcX was “set to be inlined” but do not dive details as to the actual code. Try the forceinline attribute on the function to foce inlining. Please be aware that given that this seems to be a function with a hefty foot-print, this might lower rather than increase performance.

In general, 8 bytes of spill data (equivalent to two 32-bit registers) is likely to have no impact on performance as spills are cached in the L1 cache, and the L1 is big enough to hold 8 bytes per thread under most circumstances.

Hi njuffa,

I used the forceinline as you suggested and after that the function does not show in the ptxas info which should suggest that it’s now inlined in the main function.

What did you mean with a hefty-footprint ? The function itself just does a texture fetch from a 2D texture but the function is a template with 3 different code paths but it’s called in 3 places in the code however.

Thanks for you help!

By “hefty footprint” I meant a function that contains a lot of code and local data. I based this on a stack frame size of > 600 bytes. Without seeing the source code it is not clear to me why a function that just performs some 2D texture fetches would need that much stack space. But I guess this is secondary now that the function gets inlined as you desired.