Alright, this is a weird question.
I have a kernel that I need to be able to configure at compile time how many times it is going to loop through this code. A #pragma unroll does not work in my case because I need ~128 versions of this kernel and some host logic to determine which kernel gets called, but template loop unrolling does work.
There are many versions of the kernel, each one loops a different number of times.
When I compile the code for the sm_1x architectures it works perfectly fine. The code compiles and inlines func() as many times as I need it to do.
When compiling for sm_20…well…
### Assertion failure at line 2562 of ../../gccfe/wfe_dst.cxx: ### Compiler Error during Writing WHIRL file phase: ### WFE_Increment_Scope_Level_Count: DST stack exhausted nvopencc INTERNAL ERROR: /usr/local/cuda/open64/lib//gfec returned non-zero status 1
I’ve tried the forceinline and inline attributes on the code neither of which work (the code is still not inlined and the sm_20 PTX has function calls in it which is very costly for me). As soon as I specify -Xptxas -abi=no that assertion occurs and things blow up.
Using -Xptxas -abi=no and none of the inline attributes still does not eliminate function calls.
What am I missing here?