templates/recursion problem

I’m having a problem when using templates with recursion. It works perfectly fine for emulating mode but when I switch to release it produces some weird results. The concept is to unroll the most inner loop as this:

texture<unsigned char, 2, cudaReadModeElementType> tex;

template <int x>

__device__ float comp

(

 unsigned int startx,

 unsigned int pos_y,

 unsigned char bl,

 unsigned char gr,

 unsigned char re,

 int rozn_b,

 int rozn_g,

 int rozn_r

)

{	

	rozn_b=bl-tex2D(tex,x*3+startx  ,pos_y);

	rozn_g=gr-tex2D(tex,x*3+startx+1,pos_y);

	rozn_r=re-tex2D(tex,x*3+startx+2,pos_y);

	return (sqrtf(

                __mul24(rozn_b,rozn_b)+

                __mul24(rozn_g,rozn_g)+

                __mul24(rozn_r,rozn_r))+

                comp<(x-1)(startx,pos_y,bl,gr,re,rozn_b,rozn_g,rozn_r));

}

template <>

__device__ float comp<0>

(

 unsigned int startx,

 unsigned int pos_y,

 unsigned char bl,

 unsigned char gr,

 unsigned char re,

 int rozn_b,

 int rozn_g,

 int rozn_r

)

{	

	rozn_b=bl-tex2D(tex,startx  ,pos_y);

	rozn_g=gr-tex2D(tex,startx+1,pos_y);

                rozn_r=re-tex2D(tex,startx+2,pos_y);

	return (sqrtf

                (

                __mul24(rozn_b,rozn_b)+

                __mul24(rozn_g,rozn_g)+

                __mul24(rozn_r,rozn_r)

                ));

}

I’m calling it in this way:

sum+=comp<(KERNEL_RADIUS*2)>(startx,a,bl,gr,re,rozn_b,rozn_g,rozn_r);

Using the simple “for” loop in analogical way gives corrrect results. Does anyone have any idea, what can be the problem with this code??

I use Intel Core 2 quad 4x2.4 GHz, 2 GB RAM, 8600 GTS, Windows XP,CUDA SDK & Toolkit v1.0

Jakub

Recursion is not supported

But isn’t this similar to the following code from SDK??

////////////////////////////////////////////////////////////////////////////////

// Loop unrolling templates, needed for best performance

////////////////////////////////////////////////////////////////////////////////

template<int i> __device__ float convolutionRow(float *data){

    return

        data[KERNEL_RADIUS - i] * d_Kernel[i]

        + convolutionRow<i - 1>(data);

}

template<> __device__ float convolutionRow<-1>(float *data){

    return 0;

}

template<int i> __device__ float convolutionColumn(float *data){

    return 

        data[(KERNEL_RADIUS - i) * COLUMN_TILE_W] * d_Kernel[i]

        + convolutionColumn<i - 1>(data);

}

template<> __device__ float convolutionColumn<-1>(float *data){

    return 0;

}

In SDK it works, so what’s the difference? It’s recursion with templates as well…

From page 18 of the programming guide:

I’m not very familiar with this construction in CUDA (I use templates only in Java and C++), but my guess is: the compiler transforms a recursive call to a for-loop. Hence the name: “Loop unrolling templates.” Perhaps in your case, the code is too complex for a rewrite?

In that case, if the recursive call was transformed into “for” loop, would it have any sense to unroll loops? In every cases I have seen loop unrollig leads to a significant performance improvement. But You can be right that the function may be too complex. Is there any chance that installing CUDA SDK/Toolkit 1.1 will solve the problem?

nvcc in CUDA 1.1 can automatically unroll for loops now. See Section 4.2.5.2 in the CUDA 1.1 Programming Guide.

Well, this could really help. I’ll need to install v1.1 then

Thank You both for answering.

Jakub

it’s true that you cannot do recursion. the trick with templates is that you can fake recursion (surely not in all cases). the following piece of code is not recursive on “functionlevel” but just a few functions (instantiated on compiletime) calling each other ;-)

template<int N> __device__ inline int f();

template<>              __device__ inline int f<0>()

{

        return 0;

}

template<>              __device__ inline int f<1>()

{

        return 1;

}

template<int N> __device__ inline int f()

{

        return f<N-1>() + f<N-2>();

}

__global__ ...

...

        fib = f<7>();

cheers : )

sorry for posting twice.

this is somewhat unrelated to my other post and this is probably just not very useful at all… nevertheless:

i did not find the corresponding closing quote to the one i marked in the code above. this is probably just me being blind or some sort of c&p error, but if not … ; )

I must have deleted the closing quote accidentally. It should be:

(...)+

comp<(x-1)>(startx,pos_y,bl,gr,re,rozn_b,rozn_g,rozn_r));

Well… I know that the template recursion is actually a pack of seperate functions calling each other, yet the problem is why doesn’t it work similarly to the emurelease version. I have installed CUDA toolkit v1.1 and used the

#pragma unroll” in the previous version of this code that uses “for” loops. The average times with “#pragma unroll”(it is supposed to always unroll loops) and “#pragma unroll 1”(no unrolling at all) are exactly the same so the compiler couldn’t unroll it (I suppose that the speedup gained from unrolling is != 0). As written before the problem could lay in the complexity of this function, but I haven’t found any restrictions about it in Programming Guide v1.1. Can this be a compiler bug?