I have a problem with the running time of my cuda code.
My application input is a stream data. At each time I do the following actions:
- copy a pulse of data to devise (30 micro sec)
- do some process (150 micro sec)
3.copy the result to the host (60 micro sec)
each of three above take not much time. but when I measure the entire time, it will be so much (15 mil sec). Could you please let me know why this happen?
I guess lots of time may be consumed for switching between host and device. Is this true? How can I avoid this?