I have an app that sends up 12288 bytes from each of 50 odd tiles (that can’t be consolidated easily). At around 2.5GB/s I’d be expected say 6 microseconds for each transfer, and that is indeed what NSight reports. But if I wrap the transfers in WIndows QueryPerformanceCounter calls, I get a consistent 10ms for the same size transfer.
I blamed QueryPerformanceCounter at first, since we are getting close to resolution limit, but a second pair of QueryPerformanceCounter calls outside the .co and wrapping all the tiles (and running at a reasonable resolution) agrees that it really is taking that long.
As I missing something obvious here?
@sedona, Is it possible for you to post a reproducible. I think you may be timing some additional operations within the QueryPerformanceCounter range. Note, the Nsight values are only the GPU timing. In the time line you should be able to select the range from first CPU call to the last CPU call to determine the CPU time. Nsight uses either QueryPerformanceCounter or RDTSC to collect GPU timestamps.
Here’s my test harness:
StopWatchWin timer; // cuda 5 timer class, uses QueryPerformanceFrequency
const size_t dataSize = 50000;
cpu_output_buffer_test = (int *)malloc(dataSize);
tC = timer.getTime();
tD = timer.getTime();
float transferDownMs = tD-tC;
printf(" timingDown = %.04f\n",transferDownMs);
Running this (and repeating it 35 times), NSight averages a report of about 9 microseconds to transfer 50000 bytes (quadro 2000), which makes sense with a 5GB/s transfer rate. The program though, reports about 140 microseconds consistently. Any insights very welcome!