Do CUDA Runtime APIs launch kernels internally? Or are they fundamentally different from kernels?
A few may do that. I am reasonably certain that cudaMemcpy() uses a kernel to copy data when both source and destination are in device memory.
Most CUDA API calls manipulate host-side control structures, primarily the CUDA context but also (indirectly) OS control structures by making calls to operating system APIs. Much of this is single-threaded CPU activity.
I believe cudaMemset/Async may launch a kernel under some circumstances also.