What exactly do you mean by “asynchronous”? Kernel calls will never wait for the execution on the GPU to complete (which would be pointless, since for your CPU program there is no way to ever find out), functions like cudaMemcpy do wait for all kernels to complete though.
From a GPU point of view, kernels are always executed strictly one after the other, in the way you called them. In theory, cuda streams could allow for reordering or even parallel execution, but this is not implemented and might not even be possible with current hardware.
So I would describe this as kernel calls are executed asynchronously to anything you do on the CPU but synchronously relative to any other GPU code and memcpy and other functions synchronize the CPU to the GPU.