You don’t need streams for that. The CUDA runtime API is naturally asynchronous. When you launch a kernel, control immediately returns to the host thread which executed the launch, and the host thread is free to do whatever it wants while the kernel runs.
The barrier function in the runtime API is cudaThreadSynchronize(), but you are correct that the standard memcpy() functions are also blocking. If you need non-blocking memcpy, then you will need to use the async versions of the calls, and that requires streams. But for overlapping host and device execution, nothing is needed.