Template metaprogramming

Akg · October 18, 2010, 9:31am

Hi all,

I have a requirement to do a 1024 value 1d convolution. so i started off with the convolution sample in sdk based on texture.

Now the loop unroll using template cannot do 1024 unroll as the complier throws "error: excessive recursion at instantiation of function " … [( convolutionRow< 1024 > ) throws error ]

it seems that it can unroll upto around 200 only ( found out by trial and error ).

is there any way to do it.? or any other suggestions ?

Akg · October 18, 2010, 9:31am

Hi all,

I have a requirement to do a 1024 value 1d convolution. so i started off with the convolution sample in sdk based on texture.

Now the loop unroll using template cannot do 1024 unroll as the complier throws "error: excessive recursion at instantiation of function " … [( convolutionRow< 1024 > ) throws error ]

it seems that it can unroll upto around 200 only ( found out by trial and error ).

is there any way to do it.? or any other suggestions ?

devkec · October 18, 2010, 6:34pm

see this http://www.codeproject.com/KB/cpp/loopunrolling.aspx

basically, you have to define your loop body in a functor with a

void operator()(int)

and then use the

UnrollerP

at best (this is rewritten from my original article, because this is easier to get):

//UnrollerP: loops over given size, partial unrolled

template<int InnerUnroll = 8>

public:

struct UnrollerP {

	template<typename Lambda>

	static void step(size_t Begin, size_t N, Lambda& func) {

		size_t i = Begin;

		for (; i < N - InnerUnroll; i += InnerUnroll) {

			UnrollerInternal<>::step(func, i);

		}

		for (; i < N; ++i) {

			func(i);

		}

		

	}

private:

	//start of UnrollerInternal

	template<size_t Offset = 0>

	struct UnrollerInternal {

		template<typename Lambda>

		static void step(Lambda& func, size_t i) {

			func(i + Offset);

			UnrollerInternal<Offset + 1>::step(func, i);

		}

	};

	//end of UnrollerInternal

	template<>

	struct UnrollerInternal<InnerUnroll> {

		template<typename Lambda>

		static void step(Lambda& func, size_t i) {

		}

	};

};

struct Functor {

	void operator()(int i) {

		//great stuff

	}

};

//in kernel code:

Functor myFunc;

UnrollerP<200>(0, total, myFunc);

devkec · October 18, 2010, 6:34pm

see this http://www.codeproject.com/KB/cpp/loopunrolling.aspx

basically, you have to define your loop body in a functor with a

void operator()(int)

and then use the

UnrollerP

at best (this is rewritten from my original article, because this is easier to get):

//UnrollerP: loops over given size, partial unrolled

template<int InnerUnroll = 8>

public:

struct UnrollerP {

	template<typename Lambda>

	static void step(size_t Begin, size_t N, Lambda& func) {

		size_t i = Begin;

		for (; i < N - InnerUnroll; i += InnerUnroll) {

			UnrollerInternal<>::step(func, i);

		}

		for (; i < N; ++i) {

			func(i);

		}

		

	}

private:

	//start of UnrollerInternal

	template<size_t Offset = 0>

	struct UnrollerInternal {

		template<typename Lambda>

		static void step(Lambda& func, size_t i) {

			func(i + Offset);

			UnrollerInternal<Offset + 1>::step(func, i);

		}

	};

	//end of UnrollerInternal

	template<>

	struct UnrollerInternal<InnerUnroll> {

		template<typename Lambda>

		static void step(Lambda& func, size_t i) {

		}

	};

};

struct Functor {

	void operator()(int i) {

		//great stuff

	}

};

//in kernel code:

Functor myFunc;

UnrollerP<200>(0, total, myFunc);

devkec · October 18, 2010, 6:42pm

ah sorry…you are already using templates - sorry.

well yes, the compiler has a limit in template recursion.

the solution maybe to unroll your loop in a nested loop:

//assuming there were C++0x's lambdas available in CUDA...

UnrollerP<8>::step([](int i) {

  UnrollerP<128>::step([](int j) {

	int index = i * 128 + j;

	//great stuff with index

  });

});

there is another problem, kernel size may not exceed 2 KB…with this massive amount of unrolling you might even get a slowdown if the unrolled loop does not fit into the instruction cache.

normally, you dont have to unroll the whole loop. That’s why there is the partial Unroller

devkec · October 18, 2010, 6:42pm

ah sorry…you are already using templates - sorry.

well yes, the compiler has a limit in template recursion.

the solution maybe to unroll your loop in a nested loop:

//assuming there were C++0x's lambdas available in CUDA...

UnrollerP<8>::step([](int i) {

  UnrollerP<128>::step([](int j) {

	int index = i * 128 + j;

	//great stuff with index

  });

});

there is another problem, kernel size may not exceed 2 KB…with this massive amount of unrolling you might even get a slowdown if the unrolled loop does not fit into the instruction cache.

normally, you dont have to unroll the whole loop. That’s why there is the partial Unroller

tera · October 18, 2010, 8:19pm

What is wrong with using [font=“Courier New”]#pragma unroll[/font]?

tera · October 18, 2010, 8:19pm

What is wrong with using [font=“Courier New”]#pragma unroll[/font]?

Akg · October 19, 2010, 3:44am

Thanks…

if we see the convolution code in sdk sample, there is only one line of actual code and other is just a loop, which causes the kernels total instruction count to increase by almost 2.

for(int k = -KERNEL_RADIUS; k <= KERNEL_RADIUS; k++)

	 sum += tex2D(texSrc, x + (float)k, y) * c_Kernel[KERNEL_RADIUS - k];

this is the loop to be unrolled ( KERNEL_RADIUS is 1024 )

Akg · October 19, 2010, 3:44am

Thanks…

if we see the convolution code in sdk sample, there is only one line of actual code and other is just a loop, which causes the kernels total instruction count to increase by almost 2.

for(int k = -KERNEL_RADIUS; k <= KERNEL_RADIUS; k++)

	 sum += tex2D(texSrc, x + (float)k, y) * c_Kernel[KERNEL_RADIUS - k];

this is the loop to be unrolled ( KERNEL_RADIUS is 1024 )

Akg · October 19, 2010, 3:46am

It generates another compiler warning. “too much code expansion. loop not unrolled”

Akg · October 19, 2010, 3:46am

It generates another compiler warning. “too much code expansion. loop not unrolled”

devkec · October 19, 2010, 6:13am

it does not work with template parameters:

template <int M>

__global__ void kernel(...) {

  #pragma unroll M

  for (int i = 0; i < size; ++i) {

	//stuff

  }

}

devkec · October 19, 2010, 6:13am

it does not work with template parameters:

template <int M>

__global__ void kernel(...) {

  #pragma unroll M

  for (int i = 0; i < size; ++i) {

	//stuff

  }

}