I thought I had the cuda timing intricacies sorted, but I think I’m missing something. I was doing a pinned cudamemcpy (vanilla, not async) and timing it with window’s queryperformancecounter. The results were far too long and varied widely with the run. So I went back to the sdk sample, and noticed that in there, when copying pinned memory, they use a cudaMemcpyAsync on the default stream, and use cudaEvents to time that. They only use queryperformancecounter when using non pinned transfers.
So two questions from this:
1: My understanding of a defualt stream cudaMemcpyasync on the default stream is that it is functionally equivalent to a (synchronous) cudaMemcpy, so why are we using it at all in the sample?
2: Is this the memcpy timing rule: synchronous tranfers get timed with windows, else use cuda events?