CUDA Fortran vs. CUDA C on Fermi

sWienke · April 13, 2011, 7:01am

Hi,
I had implemented several tuned versions of my program using CUDA C. Now, I did the same using CUDA Fortran. On a NVIDIA Tesla S1070 (cc 1.3) and a NVIDIA GeForce GT220 (cc 1.2), I get almost the same performance of C and fortran (for single and double precision). However, if I run both versions on a Fermi GPU (C2050) then CUDA Fortran is suddenly slower (for single precision even worse than for double precision).
Do you have an explanation why there is a performance difference only on Fermi?

BTW: For CUDA Fortran I use pgf90 11.1 and -Mcuda=fastmath,cuda3.2.

Bye, Sandra

mwolfe · April 13, 2011, 10:24pm

Try compiling with -Minfo when you build the CUDA Fortran application. Look at the registers used for compute capability 1.3 and 2.0. For compute capability 2.0 (64-bit mode), pointers are 64-bits and take two GPU registers, whereas for compute capability 1.3 pointers are only 32-bits, since the max memory size is 4GB. The CUDA Fortran compiler may not be optimizing pointer usage as much as it should. Let us know what you get, we’d be really interested if there’s a big difference.

sWienke · April 14, 2011, 3:32pm

I know that if I use -Minfo=accel I can get such an information for the PGI Accelerator programming Model. But if I use -Minfo (with or without “accel”) for CUDA Fortran I get no output. What am I missing?
Is there another option for Minfo for CUDA Fortran? (the compiler reference does not tell me any other option).
Cheers, Sandra

TheMatt · April 14, 2011, 4:30pm

For CUDA Fortran, you can use the ptxinfo option for -Mcuda. So: -Mcuda=fastmath,cuda3.2,ptxinfo. This will return the lmem, smem, cmem, and registers used for each CC.

Matt

sWienke · April 15, 2011, 2:34pm

Okay, here are my results:

With respect to CUDA Fortran, my application uses 60 registers if compiled for cc 2.0 and only 27 registers if compiled for cc 1.3/1.2. Thus, this is what Michael expected. If I run my application using the executable compiled for cc 1.3, I get the same performance as for the CUDA C run. Are there any further flags (besides fastmath) that I could specify for cc 2.0 so that it runs faster?

I also just realized that our CUDA C version wasn’t compiled for cc 2.0, either (if I do so applying -ftz=true -prec-sqrt=false -prec-div=false, the application runs still slower than the one for cc 1.3 and approximately the same time than the CUDA Fortran version compiled for cc 2.0 using fastmath… without these flags it is even slower than the CUDA Fortran version), but for cc 1.3 for the best effort approach. The cc 2.0-version uses 60 registers without the flags mentioned above and 36 registers using the flags mentioned above. The original version for cc 1.3 uses 27 registers.

Thus, the results/runtimes/registers are comparable again between CUDA Fortran and CUDA C.
But, I think, it is still unexpected to have such as great slow-down to change the compute capability from 1.3 to 2.0 on Fermi…

Topic		Replies	Views
CUDA Fortran slower? Legacy PGI Compilers	9	4808	March 7, 2011
CUDA FORTRAN compiler CUDA Programming and Performance	3	1768	January 10, 2010
new compiler gives error Legacy PGI Compilers	13	8908	February 26, 2013
cuda fortran sample code Legacy PGI Compilers	5	12837	June 16, 2010
Is CUDA's implementation of 64-bit floating precision in practice subpar to that of Fortran? CUDA Programming and Performance	2	1088	December 15, 2021
Cuda Portability and SharedMem vs Cache CUDA Programming and Performance	9	11621	October 18, 2010
PGI Accelerator on NVIDIA S1070 and S2050 Fermi Legacy PGI Compilers	3	3613	February 1, 2011
My fortran CUDA program does not work, ask for help Fortran double precision matrix multiply CUDA Programming and Performance	4	2504	January 16, 2009
Strong typing and memory copy Legacy PGI Compilers	7	13310	March 29, 2010
PGI 12.8+ and cc30 Legacy PGI Compilers	3	3433	December 7, 2012

CUDA Fortran vs. CUDA C on Fermi

Related topics