PTX coding runtime gains over CUDA coding

Dear All

 What is the expected runtime speedup gains of PTX direct coding over CUDA coding on complex Kernels?

Thanks

Luis Gonçalves

Impossible to say for the general case. You may encounter either speedup or slowdown compared to plain C++ code. Inline assembly on any platform likely has higher creation and maintenance cost compared to C++, so I usually advise against dropping down to inline assembly unless faced with a situation where the desired functionality is difficult to express efficiently at the C/C++ level due to limitations in the abstract execution model of the HLL.

For your particular use case, have you already taken into account all relevant profiler feedback and exploited all the techniques suggested by the Best Practices Guide to improve the C++ code?

I just want to know about pratical cases. THe theory I already know.

My comments above were purely based on practical experience, no theory of any kind was considered. If you want to re-phrase the questions to: “What is the maximum speedup you have observed on kernels of any kind from using inline PTX instead of C++?” then the answer would be “about 30%”.

A PTX assembly language coder of average skill is unlikely to best the compiler on any non-trivial kernel, unless it involves functionality that is poorly expressible in C++. The CUDA compiler uses a derivative of LLVM in the frontend, and from looking at generated code it is clear that LLVM incorporates some very sophisticated optimizations, high-level as well as low-level.

Thanks. It is that kind of answer I was looking for.