Is there any way to overlap any kind of host <-> device access at all? We’re trying to build a “real time” system and the inability to do any kind of overlap is killing our timing.
Roughly what I’ve got is:
12ms host -> device copy
75ms misc. processing
21ms device ->host copy
The data is already being “compressed” for both copies.
Is there any way to overlap any of that so that if I were to do that “twice” (or constantly), the average would be less than 108ms? I think I’ve seen that we can’t do that, but even if I could just do simultaneous copies that would reduce the average time by 12ms. Is this possible on the hardware? Is this possible in current CUDA?