Template metaprogramming

Hi all,

I have a requirement to do a 1024 value 1d convolution. so i started off with the convolution sample in sdk based on texture.

Now the loop unroll using template cannot do 1024 unroll as the complier throws "error: excessive recursion at instantiation of function " … [( convolutionRow< 1024 > ) throws error ]

it seems that it can unroll upto around 200 only ( found out by trial and error ).

is there any way to do it.? or any other suggestions ?

Hi all,

I have a requirement to do a 1024 value 1d convolution. so i started off with the convolution sample in sdk based on texture.

Now the loop unroll using template cannot do 1024 unroll as the complier throws "error: excessive recursion at instantiation of function " … [( convolutionRow< 1024 > ) throws error ]

it seems that it can unroll upto around 200 only ( found out by trial and error ).

is there any way to do it.? or any other suggestions ?

see this http://www.codeproject.com/KB/cpp/loopunrolling.aspx

basically, you have to define your loop body in a functor with a

void operator()(int)

and then use the

UnrollerP

at best (this is rewritten from my original article, because this is easier to get):

//UnrollerP: loops over given size, partial unrolled

template<int InnerUnroll = 8>

public:

struct UnrollerP {

	template<typename Lambda>

	static void step(size_t Begin, size_t N, Lambda& func) {

		size_t i = Begin;

		for (; i < N - InnerUnroll; i += InnerUnroll) {

			UnrollerInternal<>::step(func, i);

		}

		for (; i < N; ++i) {

			func(i);

		}

		

	}

private:

	//start of UnrollerInternal

	template<size_t Offset = 0>

	struct UnrollerInternal {

		template<typename Lambda>

		static void step(Lambda& func, size_t i) {

			func(i + Offset);

			UnrollerInternal<Offset + 1>::step(func, i);

		}

	};

	//end of UnrollerInternal

	template<>

	struct UnrollerInternal<InnerUnroll> {

		template<typename Lambda>

		static void step(Lambda& func, size_t i) {

		}

	};

};

struct Functor {

	void operator()(int i) {

		//great stuff

	}

};

//in kernel code:

Functor myFunc;

UnrollerP<200>(0, total, myFunc);

see this http://www.codeproject.com/KB/cpp/loopunrolling.aspx

basically, you have to define your loop body in a functor with a

void operator()(int)

and then use the

UnrollerP

at best (this is rewritten from my original article, because this is easier to get):

//UnrollerP: loops over given size, partial unrolled

template<int InnerUnroll = 8>

public:

struct UnrollerP {

	template<typename Lambda>

	static void step(size_t Begin, size_t N, Lambda& func) {

		size_t i = Begin;

		for (; i < N - InnerUnroll; i += InnerUnroll) {

			UnrollerInternal<>::step(func, i);

		}

		for (; i < N; ++i) {

			func(i);

		}

		

	}

private:

	//start of UnrollerInternal

	template<size_t Offset = 0>

	struct UnrollerInternal {

		template<typename Lambda>

		static void step(Lambda& func, size_t i) {

			func(i + Offset);

			UnrollerInternal<Offset + 1>::step(func, i);

		}

	};

	//end of UnrollerInternal

	template<>

	struct UnrollerInternal<InnerUnroll> {

		template<typename Lambda>

		static void step(Lambda& func, size_t i) {

		}

	};

};

struct Functor {

	void operator()(int i) {

		//great stuff

	}

};

//in kernel code:

Functor myFunc;

UnrollerP<200>(0, total, myFunc);

ah sorry…you are already using templates - sorry.

well yes, the compiler has a limit in template recursion.

the solution maybe to unroll your loop in a nested loop:

//assuming there were C++0x's lambdas available in CUDA...

UnrollerP<8>::step([](int i) {

  UnrollerP<128>::step([](int j) {

	int index = i * 128 + j;

	//great stuff with index

  });

});

there is another problem, kernel size may not exceed 2 KB…with this massive amount of unrolling you might even get a slowdown if the unrolled loop does not fit into the instruction cache.

normally, you dont have to unroll the whole loop. That’s why there is the partial Unroller

ah sorry…you are already using templates - sorry.

well yes, the compiler has a limit in template recursion.

the solution maybe to unroll your loop in a nested loop:

//assuming there were C++0x's lambdas available in CUDA...

UnrollerP<8>::step([](int i) {

  UnrollerP<128>::step([](int j) {

	int index = i * 128 + j;

	//great stuff with index

  });

});

there is another problem, kernel size may not exceed 2 KB…with this massive amount of unrolling you might even get a slowdown if the unrolled loop does not fit into the instruction cache.

normally, you dont have to unroll the whole loop. That’s why there is the partial Unroller

What is wrong with using [font=“Courier New”]#pragma unroll[/font]?

What is wrong with using [font=“Courier New”]#pragma unroll[/font]?

Thanks…

if we see the convolution code in sdk sample, there is only one line of actual code and other is just a loop, which causes the kernels total instruction count to increase by almost 2.

for(int k = -KERNEL_RADIUS; k <= KERNEL_RADIUS; k++)

	 sum += tex2D(texSrc, x + (float)k, y) * c_Kernel[KERNEL_RADIUS - k];

this is the loop to be unrolled ( KERNEL_RADIUS is 1024 )

Thanks…

if we see the convolution code in sdk sample, there is only one line of actual code and other is just a loop, which causes the kernels total instruction count to increase by almost 2.

for(int k = -KERNEL_RADIUS; k <= KERNEL_RADIUS; k++)

	 sum += tex2D(texSrc, x + (float)k, y) * c_Kernel[KERNEL_RADIUS - k];

this is the loop to be unrolled ( KERNEL_RADIUS is 1024 )

It generates another compiler warning. “too much code expansion. loop not unrolled”

It generates another compiler warning. “too much code expansion. loop not unrolled”

it does not work with template parameters:

template <int M>

__global__ void kernel(...) {

  #pragma unroll M

  for (int i = 0; i < size; ++i) {

	//stuff

  }

}

it does not work with template parameters:

template <int M>

__global__ void kernel(...) {

  #pragma unroll M

  for (int i = 0; i < size; ++i) {

	//stuff

  }

}