#pragma unroll get's ignored because of texture calls. Why? #pragma unroll causes Advisory: Loop

Description:

I have a problem with an #pragma unroll directive. The compiler refuses to unroll the loop with the message “Advisory: Loop was not unrolled, inline assembly”. The loop is inside another loop. It seems that “inline assembly” refers to three tex2D() functions in the for loop. If I remove those functions the loop gets unrolled.

The code is based on the renderVolume example in the SDK. It has been changed to front-to-back traversing. Inside the raycasting loop there is a small loop which should do a computation with a fixed number of stars. This inner loop should get unrolled but it doesn’t. I also tried to use #pragma unroll with or without number, and with a real number instead of the #define value but it did not help.

The performance lack because of this is not neglectable. I tried to unroll the loop by myself using either repeated calls to a device function or to a macro function (which both represent the loop body). Both result in a kernel which is much faster.

Do you have any idea why the loop does not get unrolled?

shortend code:

#include <cutil_inline.h>

#include <cutil_math.h>

#define STARS_NUMBER_MAX 2

texture<float4, 3, cudaReadModeElementType> tex_blah;

texture<float4, 3, cudaReadModeElementType> tex_blub;

texture<float1, 2, cudaReadModeElementType> tex_table;

// [...]

__global__ void 

d_render_RNV(

  uint *d_output, uint imageW, uint imageH, float aspectW, float aspectH,

  float albedo, float3 colorF, float tstep

)

{

  // [...]

  if (!hit) return;

  if (tnear < 0.0f) tnear = 0.0f;	 // clamp to near plane

  // [...]

  for(uint rc=0; rc<RNV_RAY_CASTING_STEP_NUMBER_MAX; ++rc) {		

	// [...]

	float4 blah = tex3D(tex_blah, pos.x, pos.y, pos.z);

	// [...]

	float4 blub = tex3D(tex_blub, pos.x, pos.y, pos.z);

	// Compiler says: "Advisory: Loop was not unrolled, inline assembly"

	// unroll ignored for unknown reason because of access to tex2D ???

	#pragma unroll STARS_NUMBER_MAX

	for(uint cur_star = 0; cur_star < STARS_NUMBER_MAX; ++cur_star) {

	  // [...]

	  float3 value; 

	  value.x = tex2D(tex_table, index1, index2.x).x;

	  value.y = tex2D(tex_table, index1, index2.y).x;

	  value.z = tex2D(tex_table, index1, index2.z).x;

	  // [...]

	}

	

	// [...]

	t += tstep;

	if (t > tfar) break;

	// [...]

  }

  // [...]

  if ((x < imageW) && (y < imageH)) {

	// write output color

	uint i = __umul24(y, imageW) + x;

	d_output[i] = rgbFloatToInt(sum);

  }

}

// [...]

Note: The […] sections only contain local variables, constant memory, and normal computations, including a few device functions. There are no other texND() functions, no access to global or shared memory, no sync functions, no other loops or conditional statements (except for those in the cuda standard functions).

Build output:

1>------ Build started: Project: DRNV, Configuration: Release x64 ------

1>Compiling with CUDA Build Rule...

1>"C:\CUDA\bin64\nvcc.exe"	-arch sm_10 -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin"	-Xcompiler "/EHsc /W3 /nologo /O2 /Zi   /MT  " -I"C:\CUDA\include" -I"C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK\C\common\inc" -maxrregcount=32 --ptxas-options=-v --compile -o "x64\Release\DRNV_kernel.cu.obj" "h:\MeinCode\DRNV 09 single\DRNV_kernel.cu" 

1>DRNV_kernel.cu

1>tmpxft_000011cc_00000000-3_DRNV_kernel.cudafe1.gpu

1>tmpxft_000011cc_00000000-8_DRNV_kernel.cudafe2.gpu

1>h:/MeinCode/DRNV 09 single/DRNV_kernel.cu(562): Advisory: Loop was not unrolled, inline assembly <----------------------------------------  That's bad!

1>ptxas info	: Compiling entry function '_Z30d_RNV_precompute_star_emission14cudaPitchedPtrj5uint

3'

1>ptxas info	: Used 13 registers, 48+16 bytes smem, 140 bytes cmem[0], 44 bytes cmem[1]

1>ptxas info	: Compiling entry function '_Z35d_RNV_precompute_star_illuminations14cudaPitchedPtrj

5uint3ff'

1>ptxas info	: Used 17 registers, 56+16 bytes smem, 140 bytes cmem[0], 16 bytes cmem[1]

1>ptxas info	: Compiling entry function '_Z33d_RNV_precompute_downsample_od_em14cudaPitchedPtrj5u

int3'

1>ptxas info	: Used 10 registers, 48+16 bytes smem, 140 bytes cmem[0], 12 bytes cmem[1]

1>ptxas info	: Compiling entry function '_Z39d_render_RNV_mr_difference_highest_ver2PjjjP6float4j

jj'

1>ptxas info	: Used 24 registers, 36+16 bytes smem, 140 bytes cmem[0], 8 bytes cmem[1]

1>ptxas info	: Compiling entry function '_Z31d_render_RNV_mr_difference_ver2P6float4jjjS0_jjj'

1>ptxas info	: Used 27 registers, 44+16 bytes smem, 140 bytes cmem[0], 8 bytes cmem[1]

1>ptxas info	: Compiling entry function '_Z34d_render_RNV_mr_difference_highestPjjj'

1>ptxas info	: Used 9 registers, 16+16 bytes smem, 140 bytes cmem[0], 4 bytes cmem[1]

1>ptxas info	: Compiling entry function '_Z26d_render_RNV_mr_differenceP6float4jjj'

1>ptxas info	: Used 15 registers, 20+16 bytes smem, 140 bytes cmem[0], 4 bytes cmem[1]

1>ptxas info	: Compiling entry function '_Z33d_render_RNV_mr_difference_lowestP6float4jjj'

1>ptxas info	: Used 12 registers, 20+16 bytes smem, 140 bytes cmem[0], 4 bytes cmem[1]

1>ptxas info	: Compiling entry function '_Z19d_render_display_mrPjjP6float4jjj'

1>ptxas info	: Used 7 registers, 36+16 bytes smem, 140 bytes cmem[0], 4 bytes cmem[1]

1>ptxas info	: Compiling entry function '_Z29d_render_RNV_mr_unrolled_funcP6float4jfffff6float3f'

1>ptxas info	: Used 32 registers, 48+16 bytes smem, 140 bytes cmem[0], 44 bytes cmem[1]

1>ptxas info	: Compiling entry function '_Z29d_render_RNV_nc_unrolled_funcPjjjfff6float3fS0_'

1>ptxas info	: Used 32 registers, 56+16 bytes smem, 140 bytes cmem[0], 48 bytes cmem[1]

1>ptxas info	: Compiling entry function '_Z26d_render_RNV_unrolled_funcPjjjfff6float3f'

1>ptxas info	: Used 32 registers, 44+16 bytes smem, 140 bytes cmem[0], 48 bytes cmem[1]

1>ptxas info	: Compiling entry function '_Z27d_render_RNV_unrolled_macroPjjjfff6float3f'

1>ptxas info	: Used 32 registers, 12+0 bytes lmem, 44+16 bytes smem, 140 bytes cmem[0], 48 bytes cmem[1]

1>ptxas info	: Compiling entry function '_Z12d_render_RNVPjjjfff6float3f'	 <---------------------------------------- That's the one!

1>ptxas info	: Used 32 registers, 20+0 bytes lmem, 44+16 bytes smem, 140 bytes cmem[0], 52 bytes cmem[1]

1>ptxas info	: Compiling entry function '_Z22d_render_front_to_backPjjjff6float46float2ffff'

1>ptxas info	: Used 23 registers, 72+16 bytes smem, 140 bytes cmem[0], 20 bytes cmem[1]

1>ptxas info	: Compiling entry function '_Z22d_render_back_to_frontPjjjff6float46float2ffff'

1>ptxas info	: Used 23 registers, 72+16 bytes smem, 140 bytes cmem[0], 20 bytes cmem[1]

1>tmpxft_000011cc_00000000-3_DRNV_kernel.cudafe1.cpp

1>tmpxft_000011cc_00000000-12_DRNV_kernel.ii

1>Build log was saved at "file://h:\MeinCode\DRNV 09 single\x64\Release\BuildLog.htm"

1>DRNV - 0 error(s), 0 warning(s)

Note: DRNV_kernel.cu(562) refers to the line with the for-loop after the #pragma in the code above.

Solution and project files are copied from the volumeRenderer example.

System:

  • Windows Vista Buisness (64 bit) or Windows XP (32 bit)

  • 8800 GTX or 260 GTX

  • Visual Studio 2008

  • Driver 190.38

  • CUDA Toolkit 2.3

  • CUDA SDK: CUDA SDK 2.3

If anyone knows… I would also like to know the answer

i get the same thing: Advisory: Loop was not unrolled, inline assembly

what does inline assembly mean?

here’s an easy to read program who’s only purpose is to trigger the problem

#include <stdio.h>

#include <cuda.h>

texture<int, 1, cudaReadModeElementType> texRef;

#define TEST_ITERATIONS 1000

__global__ void t_kernel( )

{

	volatile int r;

	int tid = threadIdx.x, i;

		

	#pragma unroll 10

	for ( i = 0; i < TEST_ITERATIONS; i ++ )

		r = tex1Dfetch(texRef, tid);

}

int main ()

{	

	int num_elements, data_size;

	int *d_data;

	

	num_elements = 256;

	data_size	= num_elements*sizeof(int);	

	// allocate array on device

	cudaMalloc((void **) &d_data,   data_size);

	

	// bind texture

	cudaBindTexture(0, texRef, d_data, data_size);

	

	// kernel invocation

	t_kernel <<< 1, 256 >>> ( );	

	// free memory

	cudaFree(d_data);

}

I preface this post by saying it is just a guess.

At present, all functions are inlined by the compiler. This will include the texture fetch calls. Normally this shouldn’t be all that problematic for the compiler because functions themselves are, for the most part, written in C. If one of the optimizer phases is asked to unroll a loop containing a function call, it basically means doing some sort of recursive expansion of the function C code, and parsing out the loop variable in the inlined function.

My understanding is that when you execute a texture fetch in a kernel, what is effectively happening is that your kernel (which is a shader thread) is using some “black magic” to fork a texture unit thread to retrieve data. It probably shouldn’t be too much of a surprise that the code to do that is some sort of assembly language hook into something pretty close to the silicon. The C compiler doesn’t know much, if anything at all, about assembly language. Rarely enough to work out how to find and parse out a loop variable inside a block of assembly code it didn’t generate. Hence the behaviour you are seeing.

thanks for the reply… so it’s not possible to unroll loops with texture fetches?

cuz i just wanted to get an accurate measurement of the texture bandwidth for various data sizes & types… :(

anyway i played with the noinline qualifier and got a new message: Advisory: Loop was not unrolled, unexpected call OPs

(but if the function doesn’t understand the texture fetch code, i don’t think the noinline would solve anything)

#include <stdio.h>

#include <cuda.h>

texture<int, 1, cudaReadModeElementType> texRef;

#define TEST_ITERATIONS 1000	

__noinline__ __device__ void tfetch()

{

	volatile int r;

	r = tex1Dfetch(texRef, threadIdx.x);		

}

__noinline__ __device__ void tloop()

{	

	#pragma unroll 10

	for ( int i = 0; i < TEST_ITERATIONS; i ++ )

		tfetch();

}

__global__ void t_kernel( )

{

	tloop();

}

int main ()

{	

	int num_elements, data_size;

	int *d_data;

	num_elements = 256;

	data_size	= num_elements*sizeof(int);

	

	// allocate array on device

	cudaMalloc((void **) &d_data,   data_size);

	

	// bind texture

	cudaBindTexture(0, texRef, d_data, data_size);

	

	// kernel invocation

	t_kernel <<< 1, 256 >>> ( );	

	// free memory

	cudaFree(d_data);

}