Problem with -fast option using PGI-11.6 and CUDA3.2 or 4.0


I’ve been banging my head up against a wall with this one. We have installed the latest PGI cuda-F compilers (11.5 and 11.6) and the latest cuda toolkits (3.2 and 4.0), all under modules so we can test one against another.

We have three machines, one with C1060s, one with GTX480s and one with GTX580s.

My code compiles fine under 11.6 and cuda 3.2 or cuda 4.0 (also fine with 11.5 and 3.2) on all the machines, all using the same OS.

My code runs fine on the C1060 machine with or without compilation using the -fast option.

However, on the GTX480 and GTX580 cards, if I compile with -fast, my code crashes with a typical cuda error:

0: copyout Memcpy (host=0x2cd32290, dev=0x202320000, size=131072) FAILED: 4(unspecified launch failure)

It runs fine however without the -fast option, and neeless to say, it runs fine in the emulator. It also runs fine on the GTX480 and 580 using PGI10.9 and cuda3.1. Unfortunately, -fast gives me a 70% speedup (yeah, maybe I could try harder…). Having spent about 20 hours solid trying to identify if I have made a mistake somewhere, I am coming to the conclusion that it may be the optimization that’s got a bug in it.

My code is ~16K lines long - I could try stripping it down, and indeed have done this, however I always come up against a brick wall before it gets to a small amount of code. The code crashes when it tries to pass device data to a device function/subroutine.

Can anyone give me some advice as to how I might proceed?


More info. -O2 also causes a crash. Looking at what is being optimized, it’s mainly reduction inlining of routines like minval, maxval, dot_product, but that happens without the -fast or -O2 options so that’s unlikely to be my problem…

With optimization turned on, the following extra optimizations are taking place:

  1. Loop unrolling
  2. Memory zero idiom, loop replaced by call to __c_mzero8
  3. Memory set idiom, loop replaced by call to __c_mset8
  4. Generated vector sse code for the loop
  5. Memory zero idiom, array assignment replaced by call to pgf90_mzero4
  6. Parallel region activated
  7. Parallel loop activated with static block schedule + barrier terminated
  8. Memory copy idiom, array assignment replaced by call to pgf90_mcopy8

It would be really useful to be able to turn off the optimization of different routines, or different sections of code. Can this be done with pragma statements? It was my understanding that in 11.6, device routines do not have to sit inside a module together with the kernel that calls them - is this really the case as when I tried my code refused to compile.

Am I right b.t.w. that print *, … should work in device and global routines in 11.6? I can’t get my code to compile with this feature:

PGF90-S-0000-Internal compiler error. unsupported procedure 445 (curk4_mod.cuf: 1244)

Any help greatly appreciated…


Hi Rob,

Can anyone give me some advice as to how I might proceed?

Are your CUDA drivers up to date on the GTX systems?

How much memory does your program use? The C1060 has 4GB while the GTX cards only have 1.5GB. Could your memory usage be right at the edge where optimization uses just a bit more and pushes the program over the limit?

You can send me the code, but I only have C1060s, C2070s, and some older GTX280s. So I’m not sure I’ll be able to reproduce the error. Maybe on the C2070 if it is a Tesla versus Fermi issue.

Am I right b.t.w. that print *, … should work in device and global routines in 11.6? I can’t get my code to compile with this feature:

Yes, we just added support for the basic “print *,” from device code. Though, right now I don’t find it too useful since all the output from all threads can be intermixed, leading to garbage. We’re working on buffering the output better.

I’m guessing your error is caused by the compiler needing to convert a value into a string, though if you could post a snipit of the code and the data types of the variables that you are printing, that would be helpful.

  • Mat

Hi Mat.

The reason why print *, ‘…’ doesn’t appear to work for me seems to be because I’m using openMP together with cuda-fortran (i.e. I have one GPU per openMP thread). If I compile without the -mp option, the compiler doesn’t complain.


Thanks Rob. I was able to recreate the ICE and submitted TPR#18059. Looks like were adding some OpenMP barriers to protect the I/O. This is correct for host code, but obviously not for device code.

My guess is that this problem will be moot once we get the device I/O buffering is in place, but it’s something our engineers should be aware of.

  • Mat

TPR 18059 - CUF: Using “print” in a CUDA kernel compiled with -mp gets “unsupported procedure” ICE