CUDA Fortran slower?

Hi, All,
I’m new to CUDA Fortran. I’m testing CUDA Fortran and comparing with CUDA C. According to my test, for a same problem, it seems CUDA Fortran is about 2 times slower than its CUDA C version.
I don’t know what’s wrong with my CUDA Fortran code. Maybe I missed something when compiling the CUDA Fortran version. I used the following command: pgfortran -fast xxx.cuf

Anyone has the same problem as me?

Hi lukeStar,

I’d need a specific example to understand what’s different. But unless you’re using textured memory, CUDA C and CUDA Fortran should have approximately the same performance.

  • Mat

I’m testing the cyclic reduction method to solve the tridiagonal system.

The method described in details in these two papers:
http://www.jcohen.name/papers/Zhang_Fast_2009.pdf
http://graphics.cs.ucdavis.edu/publications/print_pub?pub_id=978

The CUDA C version code can be found in the following link

http://cudpp.googlec…l/pcr_kernel.cu

Of course, I modified the original work to fit in my problem, but basically the core is the same. I first programmed a C version and then translated into Fortran. For 1.5 M nodes, my C version used about 68.6 s for 3000 iterations, while my fortran version used 161 s.

The link should be

http://cudpp.googlecode.com/svn-history/r96/branches/tridiagonal/cudpp/src/kernel/pcr_kernel.cu

I checked my Fortran version code. There is one difference from the C version: I have 5 arrays (size = n) declared in shared memory in C version, but have only one array (size = 5*n) in shared memory in Fortran version.

Could this the reason cause the Fortran version slower?

Possible, but I doubt it. Can you post both your CUDA C and CUDA Fortran codes or send them the PGI Customer Service (trs@pgroup.com) and as them to forward them to me?

  • Mat

I already sent you the codes. Please have a test and let me know your results!

Hi lukeStar,

By default, CUDA Fortran uses a slower but more precise divide which appears to account for the difference. Compile with “-Mcuda=fastmath” to use the faster version.

My times:

30976 ms with CUDA C compiled w/ -O3
67696 ms with CUDA Fortran compiled w/ -fast
29309 ms with CUDA Fortran compiled w/ -fast -Mcuda=fastmath

Hope this helps,
Mat

Mat, Thank you so much for your helps!
I have more questions:

  1. When should we use option “-Mcuda = fastmath”
  2. Will this option affect the accuracy of results? I compared the results using two compile options. They do have little difference (from the 4th digit after point).
  1. When should we use option “-Mcuda = fastmath”

When speed takes precedence over accuracy.

  1. Will this option affect the accuracy of results?

Most likely, but to want extent will depend on the application.

I compared the results using two compile options. They do have little difference (from the 4th digit after point).

What is the difference with the CUDA C version? I suspect it will be closer to the fastmath version.

  • Mat