I am trying to apply CUDA in a real-time hardware-in-the-loop simulation. My simulation time step is 50 usec. I have timed a single call of cuMemcpyDtoH transfering the smallest amount of data at 18 usec. Assuming that cuMemcpyHtoD takes the same amount of time, I will be spending 36 usec for just memory transfers leaving only 14 usec for kernel exection. However, my kernel requires approximately 25 usec. I must find a way to reduce the memory transfer overhead. Any help would be appreciated.