Okay, here are my results:
With respect to CUDA Fortran, my application uses 60 registers if compiled for cc 2.0 and only 27 registers if compiled for cc 1.3/1.2. Thus, this is what Michael expected. If I run my application using the executable compiled for cc 1.3, I get the same performance as for the CUDA C run. Are there any further flags (besides fastmath) that I could specify for cc 2.0 so that it runs faster?
I also just realized that our CUDA C version wasn’t compiled for cc 2.0, either (if I do so applying -ftz=true -prec-sqrt=false -prec-div=false, the application runs still slower than the one for cc 1.3 and approximately the same time than the CUDA Fortran version compiled for cc 2.0 using fastmath… without these flags it is even slower than the CUDA Fortran version), but for cc 1.3 for the best effort approach. The cc 2.0-version uses 60 registers without the flags mentioned above and 36 registers using the flags mentioned above. The original version for cc 1.3 uses 27 registers.
Thus, the results/runtimes/registers are comparable again between CUDA Fortran and CUDA C.
But, I think, it is still unexpected to have such as great slow-down to change the compute capability from 1.3 to 2.0 on Fermi…