NVCC reports: "Warning: Olimit was exceeded on function ..." What does it mean and how to

NVCC also gives the following info:

Warning: Olimit was exceeded on function _Z14KerneliiiiiiiPfjiiii; will not perform function-scope optimization.
To still perform function-scope optimization, use -OPT:Olimit=0 (no limit) or -OPT:Olimit=58662
Warning: To override Olimit for all functions in file, use -OPT:Olimit=58662
(Compiler may run out of memory or run very slowly for large Olimit values)

Obviously, it is something about optimization … however, the “-OPT:Olimit=58662” being added to the command line results in the following message:
nvcc fatal : ‘PT:Olimit=58662’: expected a number

nvcc_2.3.pdf also does not give any info on the subject. How is it possible to override the Olimit ?

Thanks in advance.

try these variations on the nvcc command line. I suspect the Olimit is an OpenCC option

–opencc-options -OPT:Olimit=58662

–opencc-options OPT:Olimit=58662

–opencc-options “-OPT:Olimit=58662”

Argh … as always, solution is simple (first line worked just fine). Thank you very much!

I observed one thing regarding this - Initially my code used double precision (all kernel computations were in Double) and I did not run into this issue.

Once I tried using Single precision (floats instead of doubles) in my kernel I ran into this issue. Does it imply that for double precision arithmetic compiler omits certain optimizations? I do understand why compiler might want to leave double precision arith untouched but I just want to confirm this.

The “Olimit exceeded” messages from the Open64 frontend (as of CUDA 4.1 used only for sm_1x compilation) are related to code size. The code size for single-precision code can be larger than the code size of the equivalent double-precision code due to differences in inlining. For example, the approximate single-precision division, reciprocal, and square root are inlined, but the IEEE-rounded double-precision division, reciprocal, and square root are called subroutines, due to their size.

That does make sense.

I had defined the device functions in which i do __fmul_ru()/__dmul_ru() and __fdiv_ru() etc as ‘inline’. Are you implying that when double multiply etc is used this inlining is ignored?

Also how can get my code to compile for the single precision mode? Removing the ‘inline’ keyword didnt help.

I use interval arithmetic for which I have to overload my +/-/* operators.

Sid.

My remarks applied to floating-point divisions generated from the “/” operator, and the sqrtf() math function.

The GPU hardware has instructions for addition, multiplication, and multiply-add. For double precision and for single precision on sm_2x, the hardware supports all four IEEE rounding modes. For single precision on sm_1x, the multiply-add is not rounded according to IEEE-75, and add / multiply only support rounding modes round-to-zero and round-to-nearest, so __fadd_r{u|d} and __fmul_r{u|d} are implemented via inline emulation code that is quite large. It is possible to efficiently implement interval arithmetic with just the two rounding modes supported by the hardware for single precision on sm_1x. See:

http://perso.ens-lyon.fr/sylvain.collange/talks/collange_gpu_interval_paris08.pdf
http://hal.inria.fr/docs/00/26/36/70/PDF/interval_gpu_fpl_v2.pdf

The “Olimit exceeded” messages are just warnings advising the user that the generated code is so big that the compiler may not be able to fully optimize it unless you pass the suggested options. They should not prevent your code from compiling and running successfully. If you pass the options suggested in the warning, your code should compile fully optimized, but it may take the compiler a very long time to generate the binary (or it may run out of memory if you are not compiling on a machine with lots system memory).

Which CUDA version are you using? If you haven’t done so yet, I would suggest switching to CUDA 4.1 to take advantage of the latest improvements in the compiler.

Thanks for your reply. The links are helpful.

My CUDA version is 4.1 -
Cuda compilation tools, release 4.1, V0.2.1221
NVIDIA UNIX x86_64 Kernel Module 285.05.32

For the interval arithmetics part - I am actually using the code from the Interval Arithmetic example provided by the NVIDIA GPU computing SDK.
It uses functions - __fadd_rd() and __fmul_rd() OR __dadd_rd() and __dmul_rd(). I do not use ‘/’ or ‘sqrt’.
Are the implementations for these functions (__xmul_rd() and __xadd_rd()) inlined for floating point /(single precision version) but not for double?
I mean when my code uses __dadd_rd() or __dmul_rd() it doesnot run into issues of overshooting the Olimit.

Is there any way to force the compiler to compile in the same way for single precision as it was doing for double precision? Since my code compiles and runs fine with double precision.

On overriding the Olimit my system probably runs out of memory or max execution time and compilation fails.

Sorry if I am missing something obvious here.
Sid

As I stated, __dadd_rd() and __dmul_rd() map directly to hardware intructions. __fadd_rd() and __fmul_rd() map to hardware instructions on sm_2x and to inlined emulation code on sm_1x.

I would suggest first attempting compilation with Olimit=0. If that doesn’t work or takes too long for your purposes, simply compile without it. The worst that should happen is that the compiler will not be able to apply all possible optimizations to your single-precision code, but the generated binary should still run perfectly fine.

If Olimit=0 works for compiling the single-precision code, it will also work for compiling the double-precision code, so there will be no difference in the mechanics of compilation. Likewise, if you ultimately chose to compile without Olimit, both single-precision and double-precision code will compile with the same compiler settings; the only difference will be that the single-precision compilation will print the warning, which you can ignore.

Thanks this explains it…