Optimization disablement using pragmas.

Hi.

I have a piece of code which behaves differently if I turn optimization (-fast) on (infact, it crashes). I would like to be able to selectively blank out which bits of code are being optimized in order to find the offending code.

I can disable loop unrolling and vectorization using:

!pgi$r nounroll
!pgi$r novector

However there are still a bunch of optimizations I’d like to disable (I am not quite sure what some of them do):

1. FUNCTION reduction inlined
2. Memory zero idiom, loop replaced by call to __c_mzero8
3. Memory set idiom, loop replaced by call to __c_mset8
4. Memory copy idiom, array assignment replaced by call to pgf90_mcopy8
5. Copy in and copy out of X in call to Y

Can anyone tell me which are the directives to prevent these optimizations, effectively meaning that I can remove “all optimization” on a routine by routine basis?

Thanks,

Rob.

Hi Rob,

The “!pgi$r opt” directive allows you to select the optimization on a per routine basis. These are -O1 and -O2 optimizations so you can add “”!pgi$r opt 0" before the routine to disable them.

Though, these optimizations are pretty safe so I doubt they are the cause. Have you run your program through Valgrind (http://www.valgrind.org) and/or added diagnostic flags like -Mbounds? Not that there can’t be compiler optimization bug, but more likely there is an error in the program such as a UMR or out-of-bounds error, that only causes problems when optimization is applied.

  • Mat

Hi Mat,

yes I use -Mbounds, no it’s not been through valgrind yet - does valgrind work on CUDA kernels? I suspect it is my code that’s to blame - it’s just that the module in which the problem lies is now enormous…I’ll keep you updated.

Rob.

does valgrind work on CUDA kernels?

No, only host code.

  • Mat

Not sure what’d going on here. All I have done now is to add !pgi$r opt 0 in front of every routine in a module. I haven’t turned on optimization yet - so I would have thought that the effect of doing this should be no change, and just a straight compile of an unoptimized module. That’s not what happens though - I get this:

bdry_find:
   8336, maxval reduction inlined
         minval reduction inlined
/tmp/pgnvdbTPed2dRjJjL.nv4(0): Warning: Olimit was exceeded on function curk4_kernel; will not perform function-scope optimization.
	To still perform function-scope optimization, use -OPT:Olimit=0 (no limit) or -OPT:Olimit=128634

then a huge scroll of:

/tmp/pgcudaforb7Ped_1oTOrn.gpu(128): Warning: Cannot tell what pointer points to, assuming global memory space

I am compiling with:

pgfortran -c foo_mod.cuf -Mpreprocess -lcurand -lcuda 
-Mcuda=fastmath,ptxinfo -Mbounds -mp -Mcuda=4.0 -Minfo 
-Mneginfo -Mcuda=keepbin -Mcuda=keepptx (+ some user defined pre-processor directives)

Any idea what might be going on - why it’s complaining about some optimization when it should all be disabled?

Rob.

By disabling my kernel subroutine’s optimization, but leaving optimization on on all device level subroutines/functions, I can now make my code run with most things optimized and no crash on the GTX480 and GTX580 - the code doesn’t crash at all on a C1060. By compariing the difference between the two compilations, there is only one small part of the optimization that appears to be causing the crash. This is the diff of the two compilations with optimization vs. without ( i.e. diff ):

304c304
'<' ptxas info    : Used 124 registers, 944+0 bytes lmem, 56+16 bytes smem, 560 bytes cmem[0], 224 bytes cmem[1], 8 bytes cmem[14]
---
'>' ptxas info    : Used 124 registers, 880+0 bytes lmem, 56+16 bytes smem, 560 bytes cmem[0], 248 bytes cmem[1], 8 bytes cmem[14]
323c323
'<'     1264 bytes stack frame, 4148 bytes spill stores, 3344 bytes spill loads
---
 '>' 1168 bytes stack frame, 3736 bytes spill stores, 2988 bytes spill loads

Is it obvious from this what is causing the GTX480 and GTX580 to fail but the C1060 to succeed - is the optimization causing the code to use too many registers or something? The kernel “with” optimization is using 64 bytes more lmem, 24 bytes less cmem, 96 more stack frame, 412 more bytes spill stores and 356 bytes more spill loads. There is no further diagnostic information to tell me what exactly is being optimized.

Rob.


Rob.

Hi Rob,

Unfortunately, this doesn’t tell us much. We’ll need the code in order to figure out what’s wrong.

Note that I’ll be on vacation till September 6th, so if you can send in a report to PGI Customer Service (trs@pgroup.com), I would appreciate it.

  • Mat