How to reduce launch time in multi-streams

Hi Experts,
I used 64 streams to do H2D, Kernel, H2D, it can get good concurrency b.w.t kernel and copy in streams. but it takes too long time to launch 64 streams.
i tried graph to speed up launch time, but all the H2D operations are put in the last. it has no concurrency b.w.t copy and kernel.
Do we have a good method to keep the concurrency and speed up the launch time?
thanks,
sky