#pragma unroll

Hello,

I have a problem with #pragma unroll, with this code part:

The compiler return a warning:

But why?

I use CUDA 3.1 and GTX480.

I noticed that too. Loops which were previously unrolled(2.3), are no longer unrolled?

template
device void function(…)

Thank you!

But I have a compiler error:

Why?

You probably aren’t calling the function with the right arguments. Also make sure your header file is updated according to your recent change.

const int m_StrideDiv4 = 6;
function<m_StrideDiv4 >(…);

The problem is, that m_StrideDiv(>=6) is calculated on the host and is only constatnt in this function.
Therefore, I can not define it as constant outside of the function.
What can I do to solve this problem?

How does the function f look like?

The funktion f is arithmetic decoder…
Without f is the “#pragma unroll” successfully, but why?

Ok, so you know that m_StrideDiv >= 6 right?

Maybe:

__device__ void function(...,const int m_StrideDiv4 /*(StrideDiv4>=6)*/,...){

#pragma unroll

for(int i=0; i<6; ++i)

{

float	deltaA = f(...) * s[i];//f-device function

v[v2 + i] -= deltaA;

}

for(int i=6; i<m_StrideDiv4; ++i)

{

float	deltaA = f(...) * s[i];//f-device function

v[v2 + i] -= deltaA;

}

}

Would that solve your problem?

EDIT:

Ok i missed this post.

The reason is that f(…) takes in some arguments that are not known at compile time.

Make sure f() has exactly one return statement.

Ok, I did it, but I have now a new error

Do you use any pointers within f()?

Yes, I use pointer with f.
f is a decoder and it has a pointer of the coded data.

Try marking them as restrict (see appendix E.3 of the Programming Guide) to indicate to the compiler that they don’t hamper with the loop counter.

I’m not sure though this is the problem. The compiler should still be able to note that the address of i is never taken.

Another question is, if f() is an expensive function, why would you want the loop to be unrolled?

__restrict__was not a problem…

Are the unrolled loops not better for performance?

I can not understand it. f() is independent of m_StrideDiv4, why is f() problematically for the compiler?

You save an increment and a branch (and potentially a comparison), i.e. two or three instructions. That’s significant if the loop body itself has only one or a few instruction, but soon diminishes as the loop body gets larger.

Unrolling might open up possibilities for other optimizations, but that does not seem to be the case here.

The compiler has to make sure to produce code that is equivalent under any circumstances. That requires a lot of analysis. Can you post the code of f()?

The problem probably is not related to m_StrideDiv4 at all.

Maybe you could show us at least a code snippet.

That’s not a problem - as long as f() has no other constructs preventing unrolling, the compiler will happily inline f() and still unroll the loop.

The code of f() is very large, with call of another functions and loops, it is a simple representation of f. Simple instructions are replaced with “…” :

__device__ unsigned int f(pointer of struct)

{	

	...

	decode(pointer of struct);

	...

	decode(pointer of struct);

	...

	decode(pointer of struct);

	...

	return ...;

}

__device__ unsigned int decode(pointer of struct)

{

	...

	for(int i=0; i<8; ++i)

	{

		...

	}

	...

	if (...){

		do {										  

			...

		} while (...);	   

	}		

	return ...;

}

If those parameters affect the addressing it most definetly should. But it doesn’t have to be the definite reason why :)

if (...){  // <------- Problem ?

		do {										  

			...

		} while (...);	// <-------- Problem ? 

	}

I think those conditionals will be a problem if they depend on dynamic variables which means the compiler doesnt know which path to take.