multiple switch between host and device


I have a problem with the running time of my cuda code.
My application input is a stream data. At each time I do the following actions:

  1. copy a pulse of data to devise (30 micro sec)
  2. do some process (150 micro sec)
    3.copy the result to the host (60 micro sec)

each of three above take not much time. but when I measure the entire time, it will be so much (15 mil sec). Could you please let me know why this happen?

I guess lots of time may be consumed for switching between host and device. Is this true? How can I avoid this?

many thanks.

How are you measuring the time of each step?