Memory copy latency and kernel launch overhead

I used the following piece of code to measure memory copy latency and kernel launch overhead for a GTX 580. I see latencies of 7.4usec, 9.5usec for host-to-device and device-to-host memory copies, and a 4.5usec overhead for launching an empty kernel. Are these in line with the expected values? Am I doing something wrong? Where can I find a good reference for some standardised testing methodologies and results? Thank you!


#include
#define NDATA 10
#define NLOOP 100000

global void doNothing() {
}

int main(void) {
float x;
x = (float
)malloc(NDATA * sizeof(float));
memset(x, 0, NDATA * sizeof(float));
float *dev_x; cudaMalloc(&dev_x, NDATA * sizeof(float));
float host_to_device_time = 0.0f, device_to_host_time = 0.0f, delta_time, kernel_overhead_time = 0.0f;

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

for (int i = 0; i < NLOOP; i++) {
	cudaEventRecord(start, 0);
	cudaMemcpy(dev_x, x, NDATA * sizeof(float), cudaMemcpyHostToDevice);
	cudaEventRecord(stop, 0);
	cudaEventSynchronize(stop);
	cudaEventElapsedTime(&delta_time, start, stop);
	host_to_device_time += delta_time;
}
host_to_device_time /= NLOOP;

for (int i = 0; i < NLOOP; i++) {
	cudaEventRecord(start, 0);
	cudaMemcpy(x, dev_x, NDATA * sizeof(float), cudaMemcpyDeviceToHost);
	cudaEventRecord(stop, 0);
	cudaEventSynchronize(stop);
	cudaEventElapsedTime(&delta_time, start, stop);
	device_to_host_time += delta_time;
}
device_to_host_time /= NLOOP;

for (int i = 0; i < NLOOP; i++) {
	cudaEventRecord(start, 0);
	doNothing<<<1,1>>>();
	cudaEventRecord(stop, 0);
	cudaEventSynchronize(stop);
	cudaEventElapsedTime(&delta_time, start, stop);
	kernel_overhead_time += delta_time;		
}
kernel_overhead_time /= NLOOP;

float data_MB = (1.0f * NDATA * sizeof(float)) / (1024 * 1024);

std::cout << "Host to Device:\n";
std::cout << "\tTotal Data transferred: " << data_MB << "MB\n";
std::cout << "\tTotal Time: " << host_to_device_time << "msec\n";
std::cout << "\tTransfer rate: " << 1.024f * data_MB / host_to_device_time << "GB/sec\n";
std::cout << "Device to Host:\n";
std::cout << "\tTotal Data transferred: " << data_MB << "MB\n";
std::cout << "\tTotal Time: " << device_to_host_time << "msec\n";
std::cout << "\tTransfer rate: " << 1.024f * data_MB / device_to_host_time << "GB/sec\n";
std::cout << "Kernel overhead:\n";
std::cout << "\t" << 1000 * kernel_overhead_time << "usec\n";

cudaEventDestroy(start);
cudaEventDestroy(stop);	

return 0;

}

You are doing it right.

The overhead could be the WDDM (Windows display driver model)

Are you using Vista/Win 7?

I don’t see an easy fix to this, but you could use a Tesla card to enable TCC drivers and bypass the WDDM or change the OS (Linux/Win XP).

Actually, I’m using a linux box with Ubuntu installed. The reason I’m skeptical is that I read on another forum post that expected latencies are around 500 clock cycles, which for a 1.544GHz processor translates into 0.3usec. My measured latencies are 1-2 orders of magnitude bigger. Surely, NVIDIA must have some empirical numbers on these; I just haven’t been able to find the source yet…

I also tested an older card, GTX 8800, and got almost double the numbers above: 15usec (host-to-device), 24usec (device-to-host), and 9usec (kernel launch overhead).

The numbers you get for launch overhead are reasonable for the GTX 580 and the 8800 GTX in Linux. Fermi cards have less overhead than older devices.

I believe 500 shader clock cycles is roughly the latency for reading device memory inside a kernel. (300 clock cycles is the number I recall from some time ago)