Description:
I have a problem with an #pragma unroll directive. The compiler refuses to unroll the loop with the message “Advisory: Loop was not unrolled, inline assembly”. The loop is inside another loop. It seems that “inline assembly” refers to three tex2D() functions in the for loop. If I remove those functions the loop gets unrolled.
The code is based on the renderVolume example in the SDK. It has been changed to front-to-back traversing. Inside the raycasting loop there is a small loop which should do a computation with a fixed number of stars. This inner loop should get unrolled but it doesn’t. I also tried to use #pragma unroll with or without number, and with a real number instead of the #define value but it did not help.
The performance lack because of this is not neglectable. I tried to unroll the loop by myself using either repeated calls to a device function or to a macro function (which both represent the loop body). Both result in a kernel which is much faster.
Do you have any idea why the loop does not get unrolled?
shortend code:
#include <cutil_inline.h>
#include <cutil_math.h>
#define STARS_NUMBER_MAX 2
texture<float4, 3, cudaReadModeElementType> tex_blah;
texture<float4, 3, cudaReadModeElementType> tex_blub;
texture<float1, 2, cudaReadModeElementType> tex_table;
// [...]
__global__ void
d_render_RNV(
uint *d_output, uint imageW, uint imageH, float aspectW, float aspectH,
float albedo, float3 colorF, float tstep
)
{
// [...]
if (!hit) return;
if (tnear < 0.0f) tnear = 0.0f; // clamp to near plane
// [...]
for(uint rc=0; rc<RNV_RAY_CASTING_STEP_NUMBER_MAX; ++rc) {
// [...]
float4 blah = tex3D(tex_blah, pos.x, pos.y, pos.z);
// [...]
float4 blub = tex3D(tex_blub, pos.x, pos.y, pos.z);
// Compiler says: "Advisory: Loop was not unrolled, inline assembly"
// unroll ignored for unknown reason because of access to tex2D ???
#pragma unroll STARS_NUMBER_MAX
for(uint cur_star = 0; cur_star < STARS_NUMBER_MAX; ++cur_star) {
// [...]
float3 value;
value.x = tex2D(tex_table, index1, index2.x).x;
value.y = tex2D(tex_table, index1, index2.y).x;
value.z = tex2D(tex_table, index1, index2.z).x;
// [...]
}
// [...]
t += tstep;
if (t > tfar) break;
// [...]
}
// [...]
if ((x < imageW) && (y < imageH)) {
// write output color
uint i = __umul24(y, imageW) + x;
d_output[i] = rgbFloatToInt(sum);
}
}
// [...]
Note: The […] sections only contain local variables, constant memory, and normal computations, including a few device functions. There are no other texND() functions, no access to global or shared memory, no sync functions, no other loops or conditional statements (except for those in the cuda standard functions).
Build output:
1>------ Build started: Project: DRNV, Configuration: Release x64 ------
1>Compiling with CUDA Build Rule...
1>"C:\CUDA\bin64\nvcc.exe" -arch sm_10 -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin" -Xcompiler "/EHsc /W3 /nologo /O2 /Zi /MT " -I"C:\CUDA\include" -I"C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK\C\common\inc" -maxrregcount=32 --ptxas-options=-v --compile -o "x64\Release\DRNV_kernel.cu.obj" "h:\MeinCode\DRNV 09 single\DRNV_kernel.cu"
1>DRNV_kernel.cu
1>tmpxft_000011cc_00000000-3_DRNV_kernel.cudafe1.gpu
1>tmpxft_000011cc_00000000-8_DRNV_kernel.cudafe2.gpu
1>h:/MeinCode/DRNV 09 single/DRNV_kernel.cu(562): Advisory: Loop was not unrolled, inline assembly <---------------------------------------- That's bad!
1>ptxas info : Compiling entry function '_Z30d_RNV_precompute_star_emission14cudaPitchedPtrj5uint
3'
1>ptxas info : Used 13 registers, 48+16 bytes smem, 140 bytes cmem[0], 44 bytes cmem[1]
1>ptxas info : Compiling entry function '_Z35d_RNV_precompute_star_illuminations14cudaPitchedPtrj
5uint3ff'
1>ptxas info : Used 17 registers, 56+16 bytes smem, 140 bytes cmem[0], 16 bytes cmem[1]
1>ptxas info : Compiling entry function '_Z33d_RNV_precompute_downsample_od_em14cudaPitchedPtrj5u
int3'
1>ptxas info : Used 10 registers, 48+16 bytes smem, 140 bytes cmem[0], 12 bytes cmem[1]
1>ptxas info : Compiling entry function '_Z39d_render_RNV_mr_difference_highest_ver2PjjjP6float4j
jj'
1>ptxas info : Used 24 registers, 36+16 bytes smem, 140 bytes cmem[0], 8 bytes cmem[1]
1>ptxas info : Compiling entry function '_Z31d_render_RNV_mr_difference_ver2P6float4jjjS0_jjj'
1>ptxas info : Used 27 registers, 44+16 bytes smem, 140 bytes cmem[0], 8 bytes cmem[1]
1>ptxas info : Compiling entry function '_Z34d_render_RNV_mr_difference_highestPjjj'
1>ptxas info : Used 9 registers, 16+16 bytes smem, 140 bytes cmem[0], 4 bytes cmem[1]
1>ptxas info : Compiling entry function '_Z26d_render_RNV_mr_differenceP6float4jjj'
1>ptxas info : Used 15 registers, 20+16 bytes smem, 140 bytes cmem[0], 4 bytes cmem[1]
1>ptxas info : Compiling entry function '_Z33d_render_RNV_mr_difference_lowestP6float4jjj'
1>ptxas info : Used 12 registers, 20+16 bytes smem, 140 bytes cmem[0], 4 bytes cmem[1]
1>ptxas info : Compiling entry function '_Z19d_render_display_mrPjjP6float4jjj'
1>ptxas info : Used 7 registers, 36+16 bytes smem, 140 bytes cmem[0], 4 bytes cmem[1]
1>ptxas info : Compiling entry function '_Z29d_render_RNV_mr_unrolled_funcP6float4jfffff6float3f'
1>ptxas info : Used 32 registers, 48+16 bytes smem, 140 bytes cmem[0], 44 bytes cmem[1]
1>ptxas info : Compiling entry function '_Z29d_render_RNV_nc_unrolled_funcPjjjfff6float3fS0_'
1>ptxas info : Used 32 registers, 56+16 bytes smem, 140 bytes cmem[0], 48 bytes cmem[1]
1>ptxas info : Compiling entry function '_Z26d_render_RNV_unrolled_funcPjjjfff6float3f'
1>ptxas info : Used 32 registers, 44+16 bytes smem, 140 bytes cmem[0], 48 bytes cmem[1]
1>ptxas info : Compiling entry function '_Z27d_render_RNV_unrolled_macroPjjjfff6float3f'
1>ptxas info : Used 32 registers, 12+0 bytes lmem, 44+16 bytes smem, 140 bytes cmem[0], 48 bytes cmem[1]
1>ptxas info : Compiling entry function '_Z12d_render_RNVPjjjfff6float3f' <---------------------------------------- That's the one!
1>ptxas info : Used 32 registers, 20+0 bytes lmem, 44+16 bytes smem, 140 bytes cmem[0], 52 bytes cmem[1]
1>ptxas info : Compiling entry function '_Z22d_render_front_to_backPjjjff6float46float2ffff'
1>ptxas info : Used 23 registers, 72+16 bytes smem, 140 bytes cmem[0], 20 bytes cmem[1]
1>ptxas info : Compiling entry function '_Z22d_render_back_to_frontPjjjff6float46float2ffff'
1>ptxas info : Used 23 registers, 72+16 bytes smem, 140 bytes cmem[0], 20 bytes cmem[1]
1>tmpxft_000011cc_00000000-3_DRNV_kernel.cudafe1.cpp
1>tmpxft_000011cc_00000000-12_DRNV_kernel.ii
1>Build log was saved at "file://h:\MeinCode\DRNV 09 single\x64\Release\BuildLog.htm"
1>DRNV - 0 error(s), 0 warning(s)
Note: DRNV_kernel.cu(562) refers to the line with the for-loop after the #pragma in the code above.
Solution and project files are copied from the volumeRenderer example.
System:
-
Windows Vista Buisness (64 bit) or Windows XP (32 bit)
-
8800 GTX or 260 GTX
-
Visual Studio 2008
-
Driver 190.38
-
CUDA Toolkit 2.3
-
CUDA SDK: CUDA SDK 2.3