Loop unrolling

I tried to unroll some loops to see if I could improve the performance of 2 nested for-loops. I got the compiler message

warning, loop was not unrolled, inline assembly
warning, loop was not unrolled, not innermost loop

What does the first warning mean?

Is it not possible to unroll nested for-loops?

How can I unroll a loop if I don’t know the number of loops at compile time? Can I make some kind of template, such that a kernel is selected at run-time? Will I have a very big executable if I make like 100 templates?

Probably it would help if you can paste the for loop which you are trying unroll…

Sorry, it’s just an ordinary 2D convolution

x_offset = -(FILTER_W - 1)/2;

		for (int filter_x = FILTER_W - 1; filter_x >= 0; filter_x--)


			y_offset = -(FILTER_H - 1)/2;		

			for (int filter_y = FILTER_H - 1; filter_y >= 0; filter_y--)


				// Set pixels outside the image to 0

				if ( (x + x_offset >= 0) && (x + x_offset < DATA_W) && (y + y_offset >= 0) && (y + y_offset < DATA_H) )


					sum += filter_shared[filter_x + filter_y * FILTER_W] * tex2D(tex_Image, x + x_offset + 0.5f, y + y_offset + 0.5f);







Hm… I’ve never seen a ‘#pragma unroll’ being done on nested for-loops. Also, it seems logical that unrolling nested loops could affect the correctness sometimes, right? (which in your case is so true!!) So, unroll should be only supported for inner-most loop.
You could test this through: Put the following statement just before the start of the innermost for loop:
#pragma unroll 3