templates/recursion problem

ROJ · January 13, 2008, 10:35pm

I’m having a problem when using templates with recursion. It works perfectly fine for emulating mode but when I switch to release it produces some weird results. The concept is to unroll the most inner loop as this:

texture<unsigned char, 2, cudaReadModeElementType> tex;

template <int x>

__device__ float comp

(

 unsigned int startx,

 unsigned int pos_y,

 unsigned char bl,

 unsigned char gr,

 unsigned char re,

 int rozn_b,

 int rozn_g,

 int rozn_r

)

{	

	rozn_b=bl-tex2D(tex,x*3+startx  ,pos_y);

	rozn_g=gr-tex2D(tex,x*3+startx+1,pos_y);

	rozn_r=re-tex2D(tex,x*3+startx+2,pos_y);

	return (sqrtf(

                __mul24(rozn_b,rozn_b)+

                __mul24(rozn_g,rozn_g)+

                __mul24(rozn_r,rozn_r))+

                comp<(x-1)(startx,pos_y,bl,gr,re,rozn_b,rozn_g,rozn_r));

}

template <>

__device__ float comp<0>

(

 unsigned int startx,

 unsigned int pos_y,

 unsigned char bl,

 unsigned char gr,

 unsigned char re,

 int rozn_b,

 int rozn_g,

 int rozn_r

)

{	

	rozn_b=bl-tex2D(tex,startx  ,pos_y);

	rozn_g=gr-tex2D(tex,startx+1,pos_y);

                rozn_r=re-tex2D(tex,startx+2,pos_y);

	return (sqrtf

                (

                __mul24(rozn_b,rozn_b)+

                __mul24(rozn_g,rozn_g)+

                __mul24(rozn_r,rozn_r)

                ));

}

I’m calling it in this way:

sum+=comp<(KERNEL_RADIUS*2)>(startx,a,bl,gr,re,rozn_b,rozn_g,rozn_r);

Using the simple “for” loop in analogical way gives corrrect results. Does anyone have any idea, what can be the problem with this code??

I use Intel Core 2 quad 4x2.4 GHz, 2 GB RAM, 8600 GTS, Windows XP,CUDA SDK & Toolkit v1.0

Jakub

S.Warris · January 14, 2008, 9:45am

Recursion is not supported

ROJ · January 14, 2008, 11:25am

But isn’t this similar to the following code from SDK??

////////////////////////////////////////////////////////////////////////////////

// Loop unrolling templates, needed for best performance

////////////////////////////////////////////////////////////////////////////////

template<int i> __device__ float convolutionRow(float *data){

    return

        data[KERNEL_RADIUS - i] * d_Kernel[i]

        + convolutionRow<i - 1>(data);

}

template<> __device__ float convolutionRow<-1>(float *data){

    return 0;

}

template<int i> __device__ float convolutionColumn(float *data){

    return 

        data[(KERNEL_RADIUS - i) * COLUMN_TILE_W] * d_Kernel[i]

        + convolutionColumn<i - 1>(data);

}

template<> __device__ float convolutionColumn<-1>(float *data){

    return 0;

}

In SDK it works, so what’s the difference? It’s recursion with templates as well…

S.Warris · January 14, 2008, 11:55am

From page 18 of the programming guide:

I’m not very familiar with this construction in CUDA (I use templates only in Java and C++), but my guess is: the compiler transforms a recursive call to a for-loop. Hence the name: “Loop unrolling templates.” Perhaps in your case, the code is too complex for a rewrite?

ROJ · January 14, 2008, 12:40pm

In that case, if the recursive call was transformed into “for” loop, would it have any sense to unroll loops? In every cases I have seen loop unrollig leads to a significant performance improvement. But You can be right that the function may be too complex. Is there any chance that installing CUDA SDK/Toolkit 1.1 will solve the problem?

seibert · January 14, 2008, 12:50pm

nvcc in CUDA 1.1 can automatically unroll for loops now. See Section 4.2.5.2 in the CUDA 1.1 Programming Guide.

ROJ · January 14, 2008, 3:41pm

Well, this could really help. I’ll need to install v1.1 then

Thank You both for answering.

Jakub

sika · January 15, 2008, 11:15pm

it’s true that you cannot do recursion. the trick with templates is that you can fake recursion (surely not in all cases). the following piece of code is not recursive on “functionlevel” but just a few functions (instantiated on compiletime) calling each other ;-)

template<int N> __device__ inline int f();

template<>              __device__ inline int f<0>()

{

        return 0;

}

template<>              __device__ inline int f<1>()

{

        return 1;

}

template<int N> __device__ inline int f()

{

        return f<N-1>() + f<N-2>();

}

__global__ ...

...

        fib = f<7>();

cheers : )

sika · January 15, 2008, 11:21pm

sorry for posting twice.

this is somewhat unrelated to my other post and this is probably just not very useful at all… nevertheless:

…

template <int x> ... comp()

{	

...

	return (sqrtf(

                __mul24(rozn_b,rozn_b)+

                __mul24(rozn_g,rozn_g)+

                __mul24(rozn_r,rozn_r))+

                comp__<__(x-1)(startx,pos_y,bl,gr,re,rozn_b,rozn_g,rozn_r));

}

...

[snapback]309652[/snapback]

i did not find the corresponding closing quote to the one i marked in the code above. this is probably just me being blind or some sort of c&p error, but if not … ; )

ROJ · January 19, 2008, 12:17am

I must have deleted the closing quote accidentally. It should be:

(...)+

comp<(x-1)>(startx,pos_y,bl,gr,re,rozn_b,rozn_g,rozn_r));

Well… I know that the template recursion is actually a pack of seperate functions calling each other, yet the problem is why doesn’t it work similarly to the emurelease version. I have installed CUDA toolkit v1.1 and used the

“#pragma unroll” in the previous version of this code that uses “for” loops. The average times with “#pragma unroll”(it is supposed to always unroll loops) and “#pragma unroll 1”(no unrolling at all) are exactly the same so the compiler couldn’t unroll it (I suppose that the speedup gained from unrolling is != 0). As written before the problem could lay in the complexity of this function, but I haven’t found any restrictions about it in Programming Guide v1.1. Can this be a compiler bug?

Topic		Replies	Views
Template metaprogramming CUDA Programming and Performance	13	2568	October 19, 2010
Problem with unrolling loops CUDA Programming and Performance	9	8558	November 24, 2011
Strange behavior of Cuda compiler when using template loops CUDA Programming and Performance	1	738	October 9, 2018
Is CUDA C or C++ ? CUDA Programming and Performance	12	33649	January 30, 2009
Loop unroll & remainder perf CUDA Programming and Performance cuda , performance , nvcc	6	3001	April 12, 2022
Loop unrolling CUDA Programming and Performance	3	6244	October 5, 2009
loop unrolling CUDA Programming and Performance	11	17002	January 31, 2008
#Pragma unroll doesn't work? CUDA Programming and Performance	8	5985	September 19, 2008
C++ Templates and NVCC's use of registers CUDA Programming and Performance	4	1276	January 2, 2017
Understanding unrolling and concurrent memory operations CUDA Programming and Performance	3	3022	July 7, 2015

templates/recursion problem

Related topics