I have an app that sends up 12288 bytes from each of 50 odd tiles (that can’t be consolidated easily). At around 2.5GB/s I’d be expected say 6 microseconds for each transfer, and that is indeed what NSight reports. But if I wrap the transfers in WIndows QueryPerformanceCounter calls, I get a consistent 10ms for the same size transfer.
I blamed QueryPerformanceCounter at first, since we are getting close to resolution limit, but a second pair of QueryPerformanceCounter calls outside the .co and wrapping all the tiles (and running at a reasonable resolution) agrees that it really is taking that long.
As I missing something obvious here?