Templates for loop unrolling, avoiding function calls, and assertion failures at compile time... ...

Alright, this is a weird question.

I have a kernel that I need to be able to configure at compile time how many times it is going to loop through this code. A #pragma unroll does not work in my case because I need ~128 versions of this kernel and some host logic to determine which kernel gets called, but template loop unrolling does work.

There are many versions of the kernel, each one loops a different number of times.

When I compile the code for the sm_1x architectures it works perfectly fine. The code compiles and inlines func() as many times as I need it to do.

When compiling for sm_20…well…

### Assertion failure at line 2562 of ../../gccfe/wfe_dst.cxx:

### Compiler Error during Writing WHIRL file phase:

### WFE_Increment_Scope_Level_Count: DST stack exhausted

nvopencc INTERNAL ERROR: /usr/local/cuda/open64/lib//gfec returned non-zero status 1

I’ve tried the forceinline and inline attributes on the code neither of which work (the code is still not inlined and the sm_20 PTX has function calls in it which is very costly for me). As soon as I specify -Xptxas -abi=no that assertion occurs and things blow up.

Using -Xptxas -abi=no and none of the inline attributes still does not eliminate function calls.

What am I missing here?

Well it seems it might have been an interesting problem in my code…but I’m not 100% convinced I have it working right. There are a few extra instructions in the PTX output that should not be there (some ld.f32’s).

Alright, found that problem too I think. Even though I am forcing it to inline NVCC decided to generate the non-inlined version of the function as well…even though the kernel never actually makes that call. When I look at all the PTX code generated it never makes a function call now.

Hi,

I just got a similar problem, gfec crashed when compiling a kernel that unroll a loop of 185 iterations.

The same some compiles fine with 183 iterations.

It looks like a problem with big templates depth or with big kernels.

(the kernel ptx code contains around 2250 lines with 183 iterations).

I use cuda 3.2 with visual studio 2008. It fails with both --gpu-architecture sm_10 and --gpu-architecture sm_13

Matthieu

Hi,

I just got a similar problem, gfec crashed when compiling a kernel that unroll a loop of 185 iterations.

The same some compiles fine with 183 iterations.

It looks like a problem with big templates depth or with big kernels.

(the kernel ptx code contains around 2250 lines with 183 iterations).

I use cuda 3.2 with visual studio 2008. It fails with both --gpu-architecture sm_10 and --gpu-architecture sm_13

Matthieu