hand-written kernel vs. CUDA library performance

Assuming you know exactly which GPU you are going to use, how would you compare the performance between a hand-written CUDA program ( only using CUDA runtime / driver APIs ) and CUDA library ( not including CUDA runtime / driver. libraries like cuBLAS, thrust… )?

You’re unlikely to be able to write your own matrix multiplication kernel that is as fast as CUBLAS gemm, unless your name is scott gray.

I would also advise folks never to write their own routines for things like sorting, or prefix sums. Use thrust or cub instead. High-quality libraries present so many advantages over code you are likely to write, that in general I think it’s a good idea to use library routines. Occasionally, due to the inflexibility of library routines, you can create your own compound CUDA kernel that does several things at once, and is faster than multiple library calls. But for programmer productivity and maintenance reasons, even this may be not a good idea.

For the most part, these statements are just a general recital of good software engineering practices (IMO); not necessarily specific to CUDA. I doubt you could write a CPU based matrix multiply routine that is as fast as the one found in Intel MKL, for example.