I believe many of you use CUBLAS functions for iterative methods. Just for fun, I tried to compare CUBLAS Saxpy with my own simple kernel which follows:
The CUBLAS saxpy is more general than your version, including incx and incy options for array stride. That extra indexing arithmetic probably slows down the CUBLAS code relative to your code, which assumes consecutive elements.
Whooooow, I raise my hat to you Seibert. Your reply is much better than I expected. I didn’t know CUBLAS sources are availible somewhere for downloading. I trusted NVIDIA keeps them in secrecy. Using you link, I can study CUBLAS code with attention. THANK YOU!!!