In a previous topic, I noted with surprise that my PGI 10.8 install seemed to be using CUDA 2.3 by default even though I have 3.1 available:
> pgaccelinfo
CUDA Driver Version: 3010
Device Number: 0
Device Name: Tesla T10 Processor
Device Revision Number: 1.3
<snip>
and am using the latest driver:
> cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 256.44 Thu Jul 29 01:22:44 PDT 2010
GCC version: gcc version 4.1.2 20080704 (Red Hat 4.1.2-48)
So, I decided to do some investigating and found that when I use -Mcuda=3.1,… I seem to get fastmath no matter what. For example, if I compile using -Mcuda=ptxinfo,keepgpu,keepbin,keepptx,maxregcount:64,nofma -Kieee with and without fastmath, I get timings like:
> grep Kernel Without31-*/cudafor-flxy-SPvDPorig.out
Without31-fastmath/cudafor-flxy-SPvDPorig.out: Kernel : 67.512 +/- 1.289
Without31-Nofastmath/cudafor-flxy-SPvDPorig.out: Kernel : 177.938 +/- 2.823
where the fastmath version is faster. But, when I use the 3.1 (-Mcuda=3.1,ptxinfo,keepgpu,keepbin,keepptx,maxregcount:64,nofma -Kieee):
> grep Kernel With31-*/cudafor-flxy-SPvDPorig.out
With31-fastmath/cudafor-flxy-SPvDPorig.out: Kernel : 67.215 +/- 1.344
With31-Nofastmath/cudafor-flxy-SPvDPorig.out: Kernel : 72.521 +/- 1.173
Now, I know timings aren’t proof, but when I look at the differences from CPU code looking at the number of elements in an array that fail a criterion (difference from CPU value), I get:
Nofastmath: Num fail: 89 out of: 1782
fastmath: Num fail: 743 out of: 1782
With 3.1 in the -Mcuda list:
Nofastmath: Num fail: 743 out of: 1782
fastmath: Num fail: 743 out of: 1782
This seems to suggest to me that using -Mcuda=3.1 is enabling fastmath by default since I’m getting the same differences in the same place (not shown, but confirmed). Is this true? And if so, is there a “nofastmath” option for use with 3.1?
Thanks,
Matt