I’ve been banging my head up against a wall with this one. We have installed the latest PGI cuda-F compilers (11.5 and 11.6) and the latest cuda toolkits (3.2 and 4.0), all under modules so we can test one against another.
We have three machines, one with C1060s, one with GTX480s and one with GTX580s.
My code compiles fine under 11.6 and cuda 3.2 or cuda 4.0 (also fine with 11.5 and 3.2) on all the machines, all using the same OS.
My code runs fine on the C1060 machine with or without compilation using the -fast option.
However, on the GTX480 and GTX580 cards, if I compile with -fast, my code crashes with a typical cuda error:
0: copyout Memcpy (host=0x2cd32290, dev=0x202320000, size=131072) FAILED: 4(unspecified launch failure)
It runs fine however without the -fast option, and neeless to say, it runs fine in the emulator. It also runs fine on the GTX480 and 580 using PGI10.9 and cuda3.1. Unfortunately, -fast gives me a 70% speedup (yeah, maybe I could try harder…). Having spent about 20 hours solid trying to identify if I have made a mistake somewhere, I am coming to the conclusion that it may be the optimization that’s got a bug in it.
My code is ~16K lines long - I could try stripping it down, and indeed have done this, however I always come up against a brick wall before it gets to a small amount of code. The code crashes when it tries to pass device data to a device function/subroutine.
Can anyone give me some advice as to how I might proceed?