cudaEventElapsedTime doesn’t work when the events are from different devices. This is probably not a fundamental limitation, because nSight can create a timeline of the whole application, with all of the devices it used. So is there a good way of comparing the time difference between two events on different devices programmatically (not in nSight)?
(I tried some hackish solutions like having one device wait for an event on another, and then registering its own event, so that I’d have two events occurring in close proximity, but the results seem poor)
at each point where you want timing measurements. Later, you can subtract the two timestamps.
Note that the callback is executed at the point when all previous cuda activity in that stream has executed, so it behaves similarly to event (completion). Separate devices (should) have separate streams:
(BTW, when I wrote that one approach I tried seemed to produce poor results, I didn’t realize that there was a bug in my code stemming from misunderstanding the semantics of cudaStreamWaitEvent)