CUDA Fortran vs. CUDA C on Fermi

I had implemented several tuned versions of my program using CUDA C. Now, I did the same using CUDA Fortran. On a NVIDIA Tesla S1070 (cc 1.3) and a NVIDIA GeForce GT220 (cc 1.2), I get almost the same performance of C and fortran (for single and double precision). However, if I run both versions on a Fermi GPU (C2050) then CUDA Fortran is suddenly slower (for single precision even worse than for double precision).
Do you have an explanation why there is a performance difference only on Fermi?

BTW: For CUDA Fortran I use pgf90 11.1 and -Mcuda=fastmath,cuda3.2.

Bye, Sandra

Try compiling with -Minfo when you build the CUDA Fortran application. Look at the registers used for compute capability 1.3 and 2.0. For compute capability 2.0 (64-bit mode), pointers are 64-bits and take two GPU registers, whereas for compute capability 1.3 pointers are only 32-bits, since the max memory size is 4GB. The CUDA Fortran compiler may not be optimizing pointer usage as much as it should. Let us know what you get, we’d be really interested if there’s a big difference.

I know that if I use -Minfo=accel I can get such an information for the PGI Accelerator programming Model. But if I use -Minfo (with or without “accel”) for CUDA Fortran I get no output. What am I missing?
Is there another option for Minfo for CUDA Fortran? (the compiler reference does not tell me any other option).
Cheers, Sandra

For CUDA Fortran, you can use the ptxinfo option for -Mcuda. So: -Mcuda=fastmath,cuda3.2,ptxinfo. This will return the lmem, smem, cmem, and registers used for each CC.


Okay, here are my results:

With respect to CUDA Fortran, my application uses 60 registers if compiled for cc 2.0 and only 27 registers if compiled for cc 1.3/1.2. Thus, this is what Michael expected. If I run my application using the executable compiled for cc 1.3, I get the same performance as for the CUDA C run. Are there any further flags (besides fastmath) that I could specify for cc 2.0 so that it runs faster?

I also just realized that our CUDA C version wasn’t compiled for cc 2.0, either (if I do so applying -ftz=true -prec-sqrt=false -prec-div=false, the application runs still slower than the one for cc 1.3 and approximately the same time than the CUDA Fortran version compiled for cc 2.0 using fastmath… without these flags it is even slower than the CUDA Fortran version), but for cc 1.3 for the best effort approach. The cc 2.0-version uses 60 registers without the flags mentioned above and 36 registers using the flags mentioned above. The original version for cc 1.3 uses 27 registers.

Thus, the results/runtimes/registers are comparable again between CUDA Fortran and CUDA C.
But, I think, it is still unexpected to have such as great slow-down to change the compute capability from 1.3 to 2.0 on Fermi…