I have a kernel that I need to be able to configure at compile time how many times it is going to loop through this code. A #pragma unroll does not work in my case because I need ~128 versions of this kernel and some host logic to determine which kernel gets called, but template loop unrolling does work.
There are many versions of the kernel, each one loops a different number of times.
When I compile the code for the sm_1x architectures it works perfectly fine. The code compiles and inlines func() as many times as I need it to do.
When compiling for sm_20…well…
### Assertion failure at line 2562 of ../../gccfe/wfe_dst.cxx:
### Compiler Error during Writing WHIRL file phase:
### WFE_Increment_Scope_Level_Count: DST stack exhausted
nvopencc INTERNAL ERROR: /usr/local/cuda/open64/lib//gfec returned non-zero status 1
I’ve tried the forceinline and inline attributes on the code neither of which work (the code is still not inlined and the sm_20 PTX has function calls in it which is very costly for me). As soon as I specify -Xptxas -abi=no that assertion occurs and things blow up.
Using -Xptxas -abi=no and none of the inline attributes still does not eliminate function calls.
Well it seems it might have been an interesting problem in my code…but I’m not 100% convinced I have it working right. There are a few extra instructions in the PTX output that should not be there (some ld.f32’s).
Alright, found that problem too I think. Even though I am forcing it to inline NVCC decided to generate the non-inlined version of the function as well…even though the kernel never actually makes that call. When I look at all the PTX code generated it never makes a function call now.