Just because you use cublas, doesn’t mean you have to copy the results back to the host before calling the next kernel. Just make sure the next kernel knows where the result(s) of the cublas calculation is/are - i.e. pass the address(es) of the result(s).
I just read somewhere else on this forum that the above statement is not totally true. It depends which cublas routine you want to use!
You are right but most of the times it is better to have 2 different kernels, each having an optimal launch grid. By trying to do everything in one kernel, you will have to compromise on the implementations of the two kernels which will outweigh the saving you got by avoiding an extra read from global memory.
Even in cublas, we sometimes split a BLAS routine in multiple kernels for that reason.
Moreover, in Fermi, you can also take advantage of concurrent kernels using streams