Problem with -fast option using PGI-11.6 and CUDA3.2 or 4.0

alfvenwave · July 4, 2011, 11:03am

Hi,

I’ve been banging my head up against a wall with this one. We have installed the latest PGI cuda-F compilers (11.5 and 11.6) and the latest cuda toolkits (3.2 and 4.0), all under modules so we can test one against another.

We have three machines, one with C1060s, one with GTX480s and one with GTX580s.

My code compiles fine under 11.6 and cuda 3.2 or cuda 4.0 (also fine with 11.5 and 3.2) on all the machines, all using the same OS.

My code runs fine on the C1060 machine with or without compilation using the -fast option.

However, on the GTX480 and GTX580 cards, if I compile with -fast, my code crashes with a typical cuda error:

0: copyout Memcpy (host=0x2cd32290, dev=0x202320000, size=131072) FAILED: 4(unspecified launch failure)

It runs fine however without the -fast option, and neeless to say, it runs fine in the emulator. It also runs fine on the GTX480 and 580 using PGI10.9 and cuda3.1. Unfortunately, -fast gives me a 70% speedup (yeah, maybe I could try harder…). Having spent about 20 hours solid trying to identify if I have made a mistake somewhere, I am coming to the conclusion that it may be the optimization that’s got a bug in it.

My code is ~16K lines long - I could try stripping it down, and indeed have done this, however I always come up against a brick wall before it gets to a small amount of code. The code crashes when it tries to pass device data to a device function/subroutine.

Can anyone give me some advice as to how I might proceed?

Rob.

alfvenwave · July 4, 2011, 2:17pm

More info. -O2 also causes a crash. Looking at what is being optimized, it’s mainly reduction inlining of routines like minval, maxval, dot_product, but that happens without the -fast or -O2 options so that’s unlikely to be my problem…

With optimization turned on, the following extra optimizations are taking place:

Loop unrolling
Memory zero idiom, loop replaced by call to __c_mzero8
Memory set idiom, loop replaced by call to __c_mset8
Generated vector sse code for the loop
Memory zero idiom, array assignment replaced by call to pgf90_mzero4
Parallel region activated
Parallel loop activated with static block schedule + barrier terminated
Memory copy idiom, array assignment replaced by call to pgf90_mcopy8

It would be really useful to be able to turn off the optimization of different routines, or different sections of code. Can this be done with pragma statements? It was my understanding that in 11.6, device routines do not have to sit inside a module together with the kernel that calls them - is this really the case as when I tried my code refused to compile.

Am I right b.t.w. that print *, … should work in device and global routines in 11.6? I can’t get my code to compile with this feature:

PGF90-S-0000-Internal compiler error. unsupported procedure 445 (curk4_mod.cuf: 1244)

Any help greatly appreciated…

Rob.

MatColgrove · July 6, 2011, 9:04pm

Hi Rob,

Can anyone give me some advice as to how I might proceed?

Are your CUDA drivers up to date on the GTX systems?

How much memory does your program use? The C1060 has 4GB while the GTX cards only have 1.5GB. Could your memory usage be right at the edge where optimization uses just a bit more and pushes the program over the limit?

You can send me the code, but I only have C1060s, C2070s, and some older GTX280s. So I’m not sure I’ll be able to reproduce the error. Maybe on the C2070 if it is a Tesla versus Fermi issue.

Am I right b.t.w. that print *, … should work in device and global routines in 11.6? I can’t get my code to compile with this feature:

Yes, we just added support for the basic “print *,” from device code. Though, right now I don’t find it too useful since all the output from all threads can be intermixed, leading to garbage. We’re working on buffering the output better.

I’m guessing your error is caused by the compiler needing to convert a value into a string, though if you could post a snipit of the code and the data types of the variables that you are printing, that would be helpful.

Mat

alfvenwave · August 1, 2011, 11:25am

Hi Mat.

The reason why print *, ‘…’ doesn’t appear to work for me seems to be because I’m using openMP together with cuda-fortran (i.e. I have one GPU per openMP thread). If I compile without the -mp option, the compiler doesn’t complain.

Rob.

MatColgrove · August 1, 2011, 5:02pm

Thanks Rob. I was able to recreate the ICE and submitted TPR#18059. Looks like were adding some OpenMP barriers to protect the I/O. This is correct for host code, but obviously not for device code.

My guess is that this problem will be moot once we get the device I/O buffering is in place, but it’s something our engineers should be aware of.

Mat

tull · June 8, 2013, 1:43am

TPR 18059 - CUF: Using “print” in a CUDA kernel compiled with -mp gets “unsupported procedure” ICE

Topic		Replies	Views
CUDA Fortran : -fast changes result Legacy PGI Compilers	1	2153	January 29, 2010
Adding -fast leads to ICE with CUDA Fortran Program Legacy PGI Compilers	3	4538	September 1, 2010
Signal: Segmentation fault (11) with "-fast" Legacy PGI Compilers	4	3188	May 22, 2019
Control pgfortran's optimization level on CUDA kernel Legacy PGI Compilers	2	3070	April 7, 2011
issue compiling for cuda Legacy PGI Compilers	3	3840	August 16, 2012
the code works with "-fast" compiler flag, but without "-fast" the compilation fails Legacy PGI Compilers	3	1012	September 2, 2019
Code won't compile with PGF 10.6-0 Legacy PGI Compilers	3	4138	October 12, 2010
optimization errors in cuda fortran Legacy PGI Compilers	2	3171	November 19, 2011
open ACC: call to cuStreamCreate returned error 1 Legacy PGI Compilers	5	3951	January 12, 2016
cuda fortran questions Legacy PGI Compilers	10	10962	July 27, 2012

Problem with -fast option using PGI-11.6 and CUDA3.2 or 4.0

Related topics