On Matmul Kernels for Cuda Fortran

This may be a question directed towards Michael Wolfe’s article in HPCWire, on “Compilers and More: Optimizing GPU Kernels” in October 2008 (available online at http://www.hpcwire.com/features/Compilers_and_More_Optimizing_GPU_Kernels.html?viewAll=y)

All the examples for matmul were done for CUDA C; would the same barrage of tests for CUDA Fortran yield the same results?

That is, are there any nuances because of Fortran programming (i.e. column-major ordering vs. row-major ordering), etc.?

I know the above is a very generic question, but my goal would be to transfer some of these CUDA C kernels over to CUDA Fortran and benchmark.

Also, I wanted to query if there are more examples of optimizing matmul kernels in Fortran beyond the one presented in the PGroup Cuda Fortran programming guide and Insider. Spam me with links or replies, thanks!

All the examples for matmul were done for CUDA C; would the same barrage of tests for CUDA Fortran yield the same results?

Not having gone through the process, I can’t be sure, but I believe it should.

That is, are there any nuances because of Fortran programming (i.e. column-major ordering vs. row-major ordering), etc.?

Since Fortran is column-major, you do need to keep this in mind when optimizing for memory usage. See the section titled “Improving Warp Performance” on my PGInsider Monte Calro article http://www.pgroup.com/lit/articles/insider/v2n1a4.htm for an example.

Also, I wanted to query if there are more examples of optimizing matmul kernels in Fortran beyond the one presented in the PGroup Cuda Fortran programming guide and Insider. Spam me with links or replies, thanks!

While I’m working on more examples, I hadn’t planned on adding more matmul variations. Any students out there looking for a paper topic?

  • Mat