How many cudamemcpyasync can run at the same time in their respective streams

Typically that is a good way to think about it. For a given GPU, you can at most have one H->D transfer and one D->H transfer (in separate streams) that will or can overlap, at any moment.

Sometimes people ask why multiple transfers to the same device in the same direction cannot take place at the same time. That question has come up several times, I tried to answer it here for example.