nvopencc internal error: out of float registers

Hello,

I bumped into following error:

nvcc -O3 -use_fast_math --prec-sqrt=false --prec-div=false --keep --keep-dir keep -L/home/tener/localopt/NVIDIA_GPU_Computing_SDK/C/lib -lcurand -lcutil_x86_64 -lparamgl_x86_64 -lglut -lGL -lglfw -lGLEW -lpthread -lboost_system -lboost_thread-mt -gencode arch=compute_20,code=sm_21 -I/home/tener/localopt/NVIDIA_GPU_Computing_SDK/C/common/inc -I/home/tener/localopt/NVIDIA_GPU_Computing_SDK/shared/inc/ -I. --compiler-options -mtune=native,-march=native,-O3   --ptxas-options '--verbose -O4' *.cu obj/graphics.o obj/main.o obj/server.o obj/utils.o -o rt

keep/kernel.cpp3.i(0): Warning: Olimit was exceeded on function _ZN8RayTraceI7SurfaceIL4Surf0E6float3fE12ModelViewRayIS2_fEEclEi; will not perform function-scope optimization.

	To still perform function-scope optimization, use -OPT:Olimit=0 (no limit) or -OPT:Olimit=47393

### Assertion failure at line 2761 of ../../be/cg/NVISA/cgtarget.cxx:

### Compiler Error in file keep/kernel.cpp3.i during Register Allocation phase:

### ran out of registers in float

nvopencc INTERNAL ERROR: /home/tener/localopt/cuda/open64/lib//be returned non-zero status 1

Does it mean I have to decrease the complexity of my kernel or it is indeed an internal error I should report?

– edit –

Adding “–opencc-options -OPT:Olimit=0” to nvcc options makes the error go away.

The assertion failure you reported is an internal compiler error we should look into. The workaround you applied avoids the underlying problem, so I suggest you continue on that basis for the time being.

If you are a registered developer, please file a bug against the compiler, attaching the .cpp3.i intermediate file (retained by compiling with --keep). If you are not a registered developer, please send me the .cpp3.i that causes this failure so I can follow up with the compiler team. You can attach the file to a personal message sent through the forums. Thank you for your help.

The code for application my application is available on Github. The version that triggers the bug is from this commit:

https://github.com/Tener/cuda-course/commit/644d1298380f04fb7a9ff1113ee9a3c3140b4536

I also attach the relevant .cpp3.i file.

I think I can tell what is causing the bug. The file I attached contains plenty of lines like this one: nvopencc bug · GitHub (~3k lines you break line every ~80 characters).

This is nothing less but inlined code for Chebyshev polynomial calculation (N=16). The code above is very repetitive and will be optimized to some pretty .ptx code:

...

	mul.ftz.f32 	%f53, %f42, %f48;

	sub.ftz.f32 	%f54, %f53, %f41;

	mul.ftz.f32 	%f55, %f44, %f50;

	sub.ftz.f32 	%f56, %f55, %f43;

	mul.ftz.f32 	%f57, %f46, %f52;

	sub.ftz.f32 	%f58, %f57, %f45;

	mul.ftz.f32 	%f59, %f42, %f54;

	sub.ftz.f32 	%f60, %f59, %f48;

	mul.ftz.f32 	%f61, %f44, %f56;

	sub.ftz.f32 	%f62, %f61, %f50;

	mul.ftz.f32 	%f63, %f46, %f58;

	sub.ftz.f32 	%f64, %f63, %f52;

	mul.ftz.f32 	%f65, %f42, %f60;

...

This is exactly what I wanted to have when I wrote this code using templates:

template <int N>

struct Chebyshev_T

{

  __host__ __device__

  static float calculate(float x)

  { 

    return 2 * x * Chebyshev_T< N-1 >::calculate(x) - Chebyshev_T< N-2 >::calculate(x);

  };

};

template <>

struct Chebyshev_T< 0 >

{

  __host__ __device__

  static float calculate(float x)

  { 

    return 1;

  };

};

template <>

struct Chebyshev_T< 1 >

{

  __host__ __device__

  static float calculate(float x)

  { 

    return x;

  };

};

Unfortunately since this template is inlined many times it jumps over the limit.
kernel.cpp3.i.txt (1.91 MB)

Thanks for providing the file. I have filed a compiler bug and attached the file. Even if code is complex, or very voluminous (as is the case here), the compiler should not throw an internal compiler error, so this is definitely something we need to resolve. Thanks again for your help.

The compiler teams suggests the use of the noinline attribute for some of the functions in this code as a workaround with potentially higher performance, as the use of “–opencc-options -OPT:Olimit=0” turns off a lot of optimizations.